Ding Yijie, Tang Jijun, Guo Fei
School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA.
Int J Mol Sci. 2016 Sep 24;17(10):1623. doi: 10.3390/ijms17101623.
Identification of protein-protein interactions (PPIs) is a difficult and important problem in biology. Since experimental methods for predicting PPIs are both expensive and time-consuming, many computational methods have been developed to predict PPIs and interaction networks, which can be used to complement experimental approaches. However, these methods have limitations to overcome. They need a large number of homology proteins or literature to be applied in their method. In this paper, we propose a novel matrix-based protein sequence representation approach to predict PPIs, using an ensemble learning method for classification. We construct the matrix of Amino Acid Contact (AAC), based on the statistical analysis of residue-pairing frequencies in a database of 6323 protein-protein complexes. We first represent the protein sequence as a Substitution Matrix Representation (SMR) matrix. Then, the feature vector is extracted by applying algorithms of Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD) on the SMR matrix. Finally, we feed the feature vector into a Random Forest (RF) for judging interaction pairs and non-interaction pairs. Our method is applied to several PPI datasets to evaluate its performance. On the S . c e r e v i s i a e dataset, our method achieves 94 . 83 % accuracy and 92 . 40 % sensitivity. Compared with existing methods, and the accuracy of our method is increased by 0 . 11 percentage points. On the H . p y l o r i dataset, our method achieves 89 . 06 % accuracy and 88 . 15 % sensitivity, the accuracy of our method is increased by 0 . 76 % . On the H u m a n PPI dataset, our method achieves 97 . 60 % accuracy and 96 . 37 % sensitivity, and the accuracy of our method is increased by 1 . 30 % . In addition, we test our method on a very important PPI network, and it achieves 92 . 71 % accuracy. In the Wnt-related network, the accuracy of our method is increased by 16 . 67 % . The source code and all datasets are available at https://figshare.com/s/580c11dce13e63cb9a53.
蛋白质-蛋白质相互作用(PPI)的识别是生物学中一个困难而重要的问题。由于预测PPI的实验方法既昂贵又耗时,因此已经开发了许多计算方法来预测PPI和相互作用网络,这些方法可用于补充实验方法。然而,这些方法仍有局限性需要克服。它们需要大量的同源蛋白或文献才能应用于其方法中。在本文中,我们提出了一种基于矩阵的新型蛋白质序列表示方法来预测PPI,使用集成学习方法进行分类。我们基于对6323个蛋白质-蛋白质复合物数据库中残基配对频率的统计分析,构建了氨基酸接触(AAC)矩阵。我们首先将蛋白质序列表示为替代矩阵表示(SMR)矩阵。然后,通过对SMR矩阵应用定向梯度直方图(HOG)和奇异值分解(SVD)算法来提取特征向量。最后,我们将特征向量输入随机森林(RF)中以判断相互作用对和非相互作用对。我们的方法应用于多个PPI数据集以评估其性能。在酿酒酵母数据集上,我们的方法达到了94.83%的准确率和92.40%的灵敏度。与现有方法相比,我们方法的准确率提高了0.11个百分点。在幽门螺杆菌数据集上,我们的方法达到了89.06%的准确率和88.15%的灵敏度,我们方法的准确率提高了0.76%。在人类PPI数据集上,我们的方法达到了97.60%的准确率和96.37%的灵敏度,我们方法的准确率提高了1.30%。此外,我们在一个非常重要的PPI网络上测试了我们的方法,其准确率达到了92.71%。在与Wnt相关的网络中,我们方法的准确率提高了16.67%。源代码和所有数据集可在https://figshare.com/s/580c11dce13e63cb9a53获取。