School of Computer Science and Shanghai Key Lab of Intelligent Information Processing.
Center for Computational System Biology, ISTBI, Fudan University, Shanghai, China.
Bioinformatics. 2018 Jul 15;34(14):2465-2473. doi: 10.1093/bioinformatics/bty130.
Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% of >70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multilabel classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore, homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have <60% sequence identity to proteins with annotations already. Thus, the vital and challenging problem now is how to develop a method for SAFP, particularly for difficult proteins.
The key of this method is to extract not only homology information but also diverse, deep-rooted information/evidence from sequence inputs and integrate them into a predictor in a both effective and efficient manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a paradigm of machine learning, especially powerful for multilabel classification.
The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods.
http://datamining-iip.fudan.edu.cn/golabeler.
Supplementary data are available at Bioinformatics online.
基因本体论(GO)被广泛用于注释蛋白质的功能并理解其生物学作用。目前,UniProtKB 中只有 <1%的 >7000 万种蛋白质具有实验性 GO 注释,这意味着对蛋白质进行自动功能预测(AFP)的强烈必要性,由于一种蛋白质具有多种 GO 术语,因此 AFP 是一个困难的多标签分类问题。这些蛋白质中的大多数只有序列作为输入信息,这表明基于序列的 AFP(SAFP:序列是唯一的输入)的重要性。此外,基于同源性的 SAFP 工具在 AFP 竞赛中具有竞争力,但它们不一定适用于所谓的困难蛋白,这些蛋白与已有注释的蛋白的序列同一性 <60%。因此,现在的关键和具有挑战性的问题是如何开发一种 SAFP 方法,特别是对于困难蛋白。
该方法的关键是不仅要从序列输入中提取同源信息,还要提取多样化、根深蒂固的信息/证据,并以有效和高效的方式将其集成到预测器中。我们提出了 GOLabeler,它将五个组件分类器集成在一起,这些分类器是从不同的特征中训练得到的,包括 GO 术语频率、序列比对、氨基酸三进制、结构域和基序以及生物物理特性等,这些特征集成在机器学习的学习排序(LTR)框架中,这是一种特别适用于多标签分类的范例。
通过使用大规模数据集广泛而彻底地检查 GOLabeler 获得的实证结果揭示了 GOLabeler 的许多有利方面,包括与最先进的 AFP 方法相比具有显著的性能优势。
http://datamining-iip.fudan.edu.cn/golabeler。
补充数据可在生物信息学在线获得。