Qin Chao, Sun Yongqi, Dong Yadong
Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China.
PLoS One. 2017 Jul 28;12(7):e0182031. doi: 10.1371/journal.pone.0182031. eCollection 2017.
Essential proteins are the proteins that are indispensable to the survival and development of an organism. Deleting a single essential protein will cause lethality or infertility. Identifying and analysing essential proteins are key to understanding the molecular mechanisms of living cells. There are two types of methods for predicting essential proteins: experimental methods, which require considerable time and resources, and computational methods, which overcome the shortcomings of experimental methods. However, the prediction accuracy of computational methods for essential proteins requires further improvement. In this paper, we propose a new computational strategy named CoTB for identifying essential proteins based on a combination of topological properties, subcellular localization information and orthologous protein information. First, we introduce several topological properties of the protein-protein interaction (PPI) network. Second, we propose new methods for measuring orthologous information and subcellular localization and a new computational strategy that uses a random forest prediction model to obtain a probability score for the proteins being essential. Finally, we conduct experiments on four different Saccharomyces cerevisiae datasets. The experimental results demonstrate that our strategy for identifying essential proteins outperforms traditional computational methods and the most recently developed method, SON. In particular, our strategy improves the prediction accuracy to 89, 78, 79, and 85 percent on the YDIP, YMIPS, YMBD and YHQ datasets at the top 100 level, respectively.
必需蛋白质是生物体生存和发育所不可或缺的蛋白质。删除单个必需蛋白质会导致致死性或不育。识别和分析必需蛋白质是理解活细胞分子机制的关键。预测必需蛋白质有两种方法:实验方法,需要大量时间和资源;计算方法,克服了实验方法的缺点。然而,必需蛋白质计算方法的预测准确性还需要进一步提高。在本文中,我们提出了一种名为CoTB的新计算策略,用于基于拓扑性质、亚细胞定位信息和直系同源蛋白质信息的组合来识别必需蛋白质。首先,我们介绍了蛋白质-蛋白质相互作用(PPI)网络的几种拓扑性质。其次,我们提出了测量直系同源信息和亚细胞定位的新方法以及一种新的计算策略,该策略使用随机森林预测模型来获得蛋白质为必需蛋白质的概率得分。最后,我们在四个不同的酿酒酵母数据集上进行了实验。实验结果表明,我们识别必需蛋白质的策略优于传统计算方法和最近开发的SON方法。特别是,我们的策略在YDIP、YMIPS、YMBD和YHQ数据集上,在排名前100的水平上分别将预测准确率提高到了89%、78%、79%和85%。