Wang Lei, You Zhu-Hong, Xia Shi-Xiong, Liu Feng, Chen Xing, Yan Xin, Zhou Yong
School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China; College of Information Science and Engineering, Zaozhuang University, Zaozhuang, Shandong 277100, China.
The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
J Theor Biol. 2017 Apr 7;418:105-110. doi: 10.1016/j.jtbi.2017.01.003. Epub 2017 Jan 11.
Protein-Protein Interactions (PPIs) are essential to most biological processes and play a critical role in most cellular functions. With the development of high-throughput biological techniques and in silico methods, a large number of PPI data have been generated for various organisms, but many problems remain unsolved. These factors promoted the development of the in silico methods based on machine learning to predict PPIs. In this study, we propose a novel method by combining ensemble Rotation Forest (RF) classifier and Discrete Cosine Transform (DCT) algorithm to predict the interactions among proteins. Specifically, the protein amino acids sequence is transformed into Position-Specific Scoring Matrix (PSSM) containing biological evolution information, and then the feature vector is extracted to present protein evolutionary information using DCT algorithm; finally, the ensemble rotation forest model is used to predict whether a given protein pair is interacting or not. When performed on Yeast and H. pylori data sets, the proposed method achieved excellent results with an average accuracy of 98.54% and 88.27%. In addition, we achieved good prediction accuracy of 98.08%, 92.75%, 98.87% and 98.72% on independent data sets (C.elegans, E.coli, H.sapiens and M.musculus). In order to further evaluate the performance of our method, we compare it with the state-of-the-art Support Vector Machine (SVM) classifier and get good results. As a web server, the source code and Yeast data sets used in this article are freely available at http://202.119.201.126:8888/DCTRF/.
蛋白质-蛋白质相互作用(PPIs)对大多数生物过程至关重要,并且在大多数细胞功能中发挥关键作用。随着高通量生物技术和计算机模拟方法的发展,已经为各种生物体生成了大量的PPIs数据,但许多问题仍未解决。这些因素推动了基于机器学习的计算机模拟方法的发展,以预测PPIs。在本研究中,我们提出了一种将集成旋转森林(RF)分类器和离散余弦变换(DCT)算法相结合的新方法,以预测蛋白质之间的相互作用。具体而言,将蛋白质氨基酸序列转换为包含生物进化信息的位置特异性得分矩阵(PSSM),然后使用DCT算法提取特征向量以呈现蛋白质进化信息;最后,使用集成旋转森林模型预测给定的蛋白质对是否相互作用。在酵母和幽门螺杆菌数据集上进行测试时,所提出的方法取得了优异的结果,平均准确率分别为98.54%和88.27%。此外,我们在独立数据集(秀丽隐杆线虫、大肠杆菌、智人和小家鼠)上分别取得了98.08%、92.75%、98.87%和98.72%的良好预测准确率。为了进一步评估我们方法的性能,我们将其与当前最先进的支持向量机(SVM)分类器进行比较,并取得了良好的结果。作为一个网络服务器,本文中使用的源代码和酵母数据集可在http://202.119.201.126:8888/DCTRF/上免费获取。