Khan Ishita K, Kihara Daisuke
Department of Computer Science.
Department of Computer Science Department of Biological Science, Purdue University, West Lafayette, IN, USA.
Bioinformatics. 2016 Aug 1;32(15):2281-8. doi: 10.1093/bioinformatics/btw166. Epub 2016 Mar 26.
Moonlighting proteins (MPs) show multiple cellular functions within a single polypeptide chain. To understand the overall landscape of their functional diversity, it is important to establish a computational method that can identify MPs on a genome scale. Previously, we have systematically characterized MPs using functional and omics-scale information. In this work, we develop a computational prediction model for automatic identification of MPs using a diverse range of protein association information.
We incorporated a diverse range of protein association information to extract characteristic features of MPs, which range from gene ontology (GO), protein-protein interactions, gene expression, phylogenetic profiles, genetic interactions and network-based graph properties to protein structural properties, i.e. intrinsically disordered regions in the protein chain. Then, we used machine learning classifiers using the broad feature space for predicting MPs. Because many known MPs lack some proteomic features, we developed an imputation technique to fill such missing features. Results on the control dataset show that MPs can be predicted with over 98% accuracy when GO terms are available. Furthermore, using only the omics-based features the method can still identify MPs with over 75% accuracy. Last, we applied the method on three genomes: Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens, and found that about 2-10% of proteins in the genomes are potential MPs.
Code available at http://kiharalab.org/MPprediction
Supplementary data are available at Bioinformatics online.
兼性蛋白质(MPs)在单一多肽链中展现出多种细胞功能。为了解其功能多样性的整体格局,建立一种能够在基因组规模上识别MPs的计算方法至关重要。此前,我们已利用功能和组学规模的信息对MPs进行了系统表征。在这项工作中,我们开发了一种计算预测模型,用于利用多种蛋白质关联信息自动识别MPs。
我们整合了多种蛋白质关联信息,以提取MPs的特征,这些信息范围从基因本体(GO)、蛋白质-蛋白质相互作用、基因表达、系统发育谱、遗传相互作用和基于网络的图属性到蛋白质结构属性,即蛋白质链中的内在无序区域。然后,我们使用机器学习分类器,利用广泛的特征空间来预测MPs。由于许多已知的MPs缺乏一些蛋白质组学特征,我们开发了一种插补技术来填补这些缺失的特征。对照数据集的结果表明,当有GO术语可用时,MPs的预测准确率超过98%。此外,仅使用基于组学的特征,该方法仍能以超过75%的准确率识别MPs。最后,我们将该方法应用于三个基因组:酿酒酵母、秀丽隐杆线虫和智人,发现基因组中约2-10%的蛋白质是潜在的MPs。
代码可在http://kiharalab.org/MPprediction获取
补充数据可在《生物信息学》在线获取。