Bioinformatics Research Center, School of Computer Engineering, Nanyang Technological University, 639798 Singapore.
BMC Bioinformatics. 2010 Jul 28;11:402. doi: 10.1186/1471-2105-11-402.
Protein-protein interactions play essential roles in protein function determination and drug design. Numerous methods have been proposed to recognize their interaction sites, however, only a small proportion of protein complexes have been successfully resolved due to the high cost. Therefore, it is important to improve the performance for predicting protein interaction sites based on primary sequence alone.
We propose a new idea to construct an integrative profile for each residue in a protein by combining its hydrophobic and evolutionary information. A support vector machine (SVM) ensemble is then developed, where SVMs train on different pairs of positive (interface sites) and negative (non-interface sites) subsets. The subsets having roughly the same sizes are grouped in the order of accessible surface area change before and after complexation. A self-organizing map (SOM) technique is applied to group similar input vectors to make more accurate the identification of interface residues. An ensemble of ten-SVMs achieves an MCC improvement by around 8% and F1 improvement by around 9% over that of three-SVMs. As expected, SVM ensembles constantly perform better than individual SVMs. In addition, the model by the integrative profiles outperforms that based on the sequence profile or the hydropathy scale alone. As our method uses a small number of features to encode the input vectors, our model is simpler, faster and more accurate than the existing methods.
The integrative profile by combining hydrophobic and evolutionary information contributes most to the protein-protein interaction prediction. Results show that evolutionary context of residue with respect to hydrophobicity makes better the identification of protein interface residues. In addition, the ensemble of SVM classifiers improves the prediction performance.
Datasets and software are available at http://mail.ustc.edu.cn/~bigeagle/BMCBioinfo2010/index.htm.
蛋白质-蛋白质相互作用在蛋白质功能确定和药物设计中起着至关重要的作用。已经提出了许多方法来识别它们的相互作用位点,但是由于成本高,只有一小部分蛋白质复合物被成功解析。因此,提高仅基于序列预测蛋白质相互作用位点的性能非常重要。
我们提出了一种新的想法,通过结合蛋白质中每个残基的疏水和亲水信息来构建整合图。然后开发了一个支持向量机(SVM)集成,其中 SVM 分别在不同的正负(界面位点)和负(非界面位点)子集上进行训练。在复杂之前和之后,根据可及表面积变化,将具有大致相同大小的子集按顺序分组。然后应用自组织映射(SOM)技术将相似的输入向量分组,以更准确地识别界面残基。十个 SVM 的集成在 MCC 上提高了约 8%,在 F1 上提高了约 9%,而三个 SVM 的提高了约 8%。正如预期的那样,SVM 集成始终比单个 SVM 表现更好。此外,基于整合图的模型优于基于序列图或疏水力图的模型。由于我们的方法使用少量特征来编码输入向量,因此我们的模型比现有方法更简单、更快、更准确。
结合疏水和亲水信息的整合图对蛋白质-蛋白质相互作用预测贡献最大。结果表明,残基的疏水和亲水进化背景使蛋白质界面残基的识别效果更好。此外,SVM 分类器的集成提高了预测性能。
数据集和软件可在 http://mail.ustc.edu.cn/~bigeagle/BMCBioinfo2010/index.htm 获得。