PIT Bioinformatics Group, Eötvös University, H-1117 Budapest, Hungary.
PIT Bioinformatics Group, Eötvös University, H-1117 Budapest, Hungary; Uratim Ltd., H-1118 Budapest, Hungary.
Methods. 2018 Jan 1;132:50-56. doi: 10.1016/j.ymeth.2017.06.034. Epub 2017 Jul 3.
Biological sequences can be considered as data items of high-, non-fixed dimensions, corresponding to the length of those sequences. The comparison and the classification of biological sequences in their relations to large databases are important areas of research today. Artificial neural networks (ANNs) have gained a well-deserved popularity among machine learning tools upon their recent successful applications in image- and sound processing and classification problems. ANNs have also been applied for predicting the family or function of a protein, knowing its residue sequence. Here we present two new ANNs with multi-label classification ability, showing impressive accuracy when classifying protein sequences into 698 UniProt families (AUC=99.99%) and 983 Gene Ontology classes (AUC=99.45%).
生物序列可以被视为具有高维度、非固定维度的数据项,对应于序列的长度。在将生物序列与其大型数据库的关系进行比较和分类方面,这是当今的重要研究领域。人工神经网络 (ANN) 在最近成功应用于图像处理和声音处理以及分类问题之后,在机器学习工具中获得了当之无愧的普及。ANN 也已被用于预测蛋白质的家族或功能,只需知道其残基序列。在这里,我们提出了两种具有多标签分类能力的新 ANN,在将蛋白质序列分类为 698 个 UniProt 家族(AUC=99.99%)和 983 个 Gene Ontology 类(AUC=99.45%)时,表现出令人印象深刻的准确性。