Cao Renzhi, Cheng Jianlin
Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA.
Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA.
Methods. 2016 Jan 15;93:84-91. doi: 10.1016/j.ymeth.2015.09.011. Epub 2015 Sep 11.
Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein-protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene-gene interaction networks generated from chromosomal conformation data together to improve protein function prediction.
In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein-protein interaction and spatial gene-gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein-protein interaction and spatial gene-gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile-sequence comparison, profile-profile comparison, and domain co-occurrence networks according to the maximum F-measure.
蛋白质功能预测是生物信息学和计算生物学中一个重要且具有挑战性的问题。功能相关的生物信息,如蛋白质序列、基因表达和蛋白质 - 蛋白质相互作用,大多被分别用于蛋白质功能预测。其中一个主要挑战是如何有效地整合多种传统和新信息源,如从染色体构象数据生成的空间基因 - 基因相互作用网络,以改进蛋白质功能预测。
在这项工作中,我们开发了三种不同的概率得分(MIS、SEQ和NET得分),用于结合蛋白质序列、功能关联以及蛋白质 - 蛋白质相互作用和空间基因 - 基因相互作用网络来进行蛋白质功能预测。MIS得分主要由PSI - BLAST搜索找到的同源蛋白质生成,还包括通过挖掘Swiss - Prot数据库学习到的基因本体术语之间的关联规则。SEQ得分由蛋白质序列生成。NET得分由蛋白质 - 蛋白质相互作用和空间基因 - 基因相互作用网络生成。这三个得分被整合到一个新的统计多重综合评分系统(SMISS)中以预测蛋白质功能。我们在2011年功能注释关键评估(CAFA)数据集上测试了SMISS。根据最大F值度量,该方法的表现明显优于三种基线方法以及一种基于蛋白质谱 - 序列比较、谱 - 谱比较和结构域共现网络的先进方法。