Chen Xue-wen, Jeong Jong Cheol
Bioinformatics and Computational Life Sciences Laboratory, Information and Telecommunication Technology Center, University of Kansas, Lawrence, KS 66045, USA.
Bioinformatics. 2009 Mar 1;25(5):585-91. doi: 10.1093/bioinformatics/btp039. Epub 2009 Jan 19.
Identification of protein interaction sites has significant impact on understanding protein function, elucidating signal transduction networks and drug design studies. With the exponentially growing protein sequence data, predictive methods using sequence information only for protein interaction site prediction have drawn increasing interest. In this article, we propose a predictive model for identifying protein interaction sites. Without using any structure data, the proposed method extracts a wide range of features from protein sequences. A random forest-based integrative model is developed to effectively utilize these features and to deal with the imbalanced data classification problem commonly encountered in binding site predictions.
We evaluate the predictive method using 2829 interface residues and 24,616 non-interface residues extracted from 99 polypeptide chains in the Protein Data Bank. The experimental results show that the proposed method performs significantly better than two other sequence-based predictive methods and can reliably predict residues involved in protein interaction sites. Furthermore, we apply the method to predict interaction sites and to construct three protein complexes: the DnaK molecular chaperone system, 1YUW and 1DKG, which provide new insight into the sequence-function relationship. We show that the predicted interaction sites can be valuable as a first approach for guiding experimental methods investigating protein-protein interactions and localizing the specific interface residues.
Datasets and software are available at http://ittc.ku.edu/~xwchen/bindingsite/prediction.
蛋白质相互作用位点的识别对于理解蛋白质功能、阐明信号转导网络以及药物设计研究具有重大影响。随着蛋白质序列数据呈指数级增长,仅使用序列信息进行蛋白质相互作用位点预测的方法已引起越来越多的关注。在本文中,我们提出了一种用于识别蛋白质相互作用位点的预测模型。该方法无需使用任何结构数据,而是从蛋白质序列中提取广泛的特征。我们开发了一种基于随机森林的整合模型,以有效利用这些特征并处理结合位点预测中常见的不平衡数据分类问题。
我们使用从蛋白质数据库中99条多肽链提取的2829个界面残基和24616个非界面残基对预测方法进行了评估。实验结果表明,所提出的方法比其他两种基于序列的预测方法表现得明显更好,并且能够可靠地预测参与蛋白质相互作用位点的残基。此外,我们应用该方法预测相互作用位点并构建了三个蛋白质复合物:DnaK分子伴侣系统、1YUW和1DKG,这为序列-功能关系提供了新的见解。我们表明,预测的相互作用位点作为指导研究蛋白质-蛋白质相互作用的实验方法和定位特定界面残基的首要方法可能具有重要价值。
数据集和软件可在http://ittc.ku.edu/~xwchen/bindingsite/prediction获取。