College of Computer and Information Science, Southwest University, Chongqing 400715, China.
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.
Bioinformatics. 2016 Oct 1;32(19):2996-3004. doi: 10.1093/bioinformatics/btw366. Epub 2016 Jun 17.
Predicting the biological functions of proteins is one of the key challenges in the post-genomic era. Computational models have demonstrated the utility of applying machine learning methods to predict protein function. Most prediction methods explicitly require a set of negative examples-proteins that are known not carrying out a particular function. However, Gene Ontology (GO) almost always only provides the knowledge that proteins carry out a particular function, and functional annotations of proteins are incomplete. GO structurally organizes more than tens of thousands GO terms and a protein is annotated with several (or dozens) of these terms. For these reasons, the negative examples of a protein can greatly help distinguishing true positive examples of the protein from such a large candidate GO space.
In this paper, we present a novel approach (called NegGOA) to select negative examples. Specifically, NegGOA takes advantage of the ontology structure, available annotations and potentiality of additional annotations of a protein to choose negative examples of the protein. We compare NegGOA with other negative examples selection algorithms and find that NegGOA produces much fewer false negatives than them. We incorporate the selected negative examples into an efficient function prediction model to predict the functions of proteins in Yeast, Human, Mouse and Fly. NegGOA also demonstrates improved accuracy than these comparing algorithms across various evaluation metrics. In addition, NegGOA is less suffered from incomplete annotations of proteins than these comparing methods.
The Matlab and R codes are available at https://sites.google.com/site/guoxian85/neggoa
Supplementary data are available at Bioinformatics online.
预测蛋白质的生物功能是后基因组时代的主要挑战之一。计算模型已经证明了应用机器学习方法来预测蛋白质功能的有效性。大多数预测方法明确需要一组负例——已知不执行特定功能的蛋白质。然而,GO(Gene Ontology)几乎总是只提供蛋白质执行特定功能的知识,并且蛋白质的功能注释是不完整的。GO 结构上组织了超过数万 GO 术语,并且一个蛋白质被注释了几个(或几十个)这些术语。由于这些原因,蛋白质的负例可以极大地帮助从如此大的候选 GO 空间中区分蛋白质的真正阳性例子。
在本文中,我们提出了一种选择负例的新方法(称为 NegGOA)。具体来说,NegGOA 利用本体结构、蛋白质的可用注释和潜在的额外注释来选择蛋白质的负例。我们将 NegGOA 与其他负例选择算法进行比较,发现 NegGOA 产生的假阴性比它们少得多。我们将选择的负例纳入到一个有效的功能预测模型中,以预测酵母、人类、小鼠和果蝇中的蛋白质功能。NegGOA 在各种评估指标上的准确性也优于这些比较算法。此外,NegGOA 比这些比较方法受蛋白质注释不完整的影响更小。
Matlab 和 R 代码可在 https://sites.google.com/site/guoxian85/neggoa 获得。
补充数据可在 Bioinformatics 在线获得。