School of Computer Science and Information Engineering, Tianjin University of Science and Technology, Tianjin 300222, China.
Protein Eng Des Sel. 2012 Mar;25(3):119-26. doi: 10.1093/protein/gzr066. Epub 2012 Jan 18.
Prediction of hot spots in protein interfaces provides crucial information for the research on protein-protein interaction and drug design. Existing machine learning methods generally judge whether a given residue is likely to be a hot spot by extracting features only from the target residue. However, hot spots usually form a small cluster of residues which are tightly packed together at the center of protein interface. With this in mind, we present a novel method to extract hybrid features which incorporate a wide range of information of the target residue and its spatially neighboring residues, i.e. the nearest contact residue in the other face (mirror-contact residue) and the nearest contact residue in the same face (intra-contact residue). We provide a novel random forest (RF) model to effectively integrate these hybrid features for predicting hot spots in protein interfaces. Our method can achieve accuracy (ACC) of 82.4% and Matthew's correlation coefficient (MCC) of 0.482 in Alanine Scanning Energetics Database, and ACC of 77.6% and MCC of 0.429 in Binding Interface Database. In a comparison study, performance of our RF model exceeds other existing methods, such as Robetta, FOLDEF, KFC, KFC2, MINERVA and HotPoint. Of our hybrid features, three physicochemical features of target residues (mass, polarizability and isoelectric point), the relative side-chain accessible surface area and the average depth index of mirror-contact residues are found to be the main discriminative features in hot spots prediction. We also confirm that hot spots tend to form large contact surface areas between two interacting proteins. Source data and code are available at: http://www.aporc.org/doc/wiki/HotSpot.
预测蛋白质界面的热点为研究蛋白质-蛋白质相互作用和药物设计提供了关键信息。现有的机器学习方法通常通过仅从目标残基提取特征来判断给定的残基是否可能成为热点。然而,热点通常形成一个小的残基簇,这些残基紧密地聚集在蛋白质界面的中心。考虑到这一点,我们提出了一种新的方法来提取混合特征,这些特征包含目标残基及其空间相邻残基的广泛信息,即另一个面的最近接触残基(镜像接触残基)和同一面的最近接触残基(内接触残基)。我们提供了一种新的随机森林 (RF) 模型来有效地整合这些混合特征,以预测蛋白质界面中的热点。我们的方法在丙氨酸扫描能量数据库中可达到 82.4%的准确性 (ACC) 和 0.482 的马修相关系数 (MCC),在结合界面数据库中可达到 77.6%的准确性和 0.429 的马修相关系数。在对比研究中,我们的 RF 模型的性能优于其他现有的方法,如 Robetta、FOLDEF、KFC、KFC2、MINERVA 和 HotPoint。在我们的混合特征中,发现目标残基的三个物理化学特征(质量、极化率和等电点)、相对侧链可及表面积和镜像接触残基的平均深度指数是热点预测中的主要判别特征。我们还证实,热点往往在两个相互作用的蛋白质之间形成较大的接触面。原始数据和代码可在以下网址获取:http://www.aporc.org/doc/wiki/HotSpot。