Wei Zhi-Sen, Yang Jing-Yu, Shen Hong-Bin, Yu Dong-Jun
IEEE Trans Nanobioscience. 2015 Oct;14(7):746-60. doi: 10.1109/TNB.2015.2475359. Epub 2015 Sep 28.
Protein-protein interactions exist ubiquitously and play important roles in the life cycles of living cells. The interaction sites (residues) are essential to understanding the underlying mechanisms of protein-protein interactions. Previous research has demonstrated that the accurate identification of protein-protein interaction sites (PPIs) is helpful for developing new therapeutic drugs because many drugs will interact directly with those residues. Because of its significant potential in biological research and drug development, the prediction of PPIs has become an important topic in computational biology. However, a severe data imbalance exists in the PPIs prediction problem, where the number of the majority class samples (non-interacting residues) is far larger than that of the minority class samples (interacting residues). Thus, we developed a novel cascade random forests algorithm (CRF) to address the serious data imbalance that exists in the PPIs prediction problem. The proposed CRF resolves the negative effect of data imbalance by connecting multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples using an effective ensemble protocol. Based on the proposed CRF, we implemented a new sequence-based PPIs predictor, called CRF-PPI, which takes the combined features of position-specific scoring matrices, averaged cumulative hydropathy, and predicted relative solvent accessibility as model inputs. Benchmark experiments on both the cross validation and independent validation datasets demonstrated that the proposed CRF-PPI outperformed the state-of-the-art sequence-based PPIs predictors. The source code for CRF-PPI and the benchmark datasets are available online at http://csbio.njust.edu.cn/bioinf/CRF-PPI for free academic use.
蛋白质-蛋白质相互作用普遍存在,在活细胞的生命周期中发挥着重要作用。相互作用位点(残基)对于理解蛋白质-蛋白质相互作用的潜在机制至关重要。先前的研究表明,准确识别蛋白质-蛋白质相互作用位点(PPI)有助于开发新的治疗药物,因为许多药物会直接与这些残基相互作用。由于其在生物学研究和药物开发中的巨大潜力,PPI的预测已成为计算生物学中的一个重要课题。然而,PPI预测问题中存在严重的数据不平衡,其中多数类样本(非相互作用残基)的数量远大于少数类样本(相互作用残基)的数量。因此,我们开发了一种新颖的级联随机森林算法(CRF)来解决PPI预测问题中存在的严重数据不平衡。所提出的CRF通过以级联方式连接多个随机森林来解决数据不平衡的负面影响,每个随机森林都使用有效的集成协议,用包含所有少数样本和一部分多数样本的平衡训练子集进行训练。基于所提出的CRF,我们实现了一种新的基于序列的PPI预测器,称为CRF-PPI,它将位置特异性评分矩阵、平均累积亲水性和预测的相对溶剂可及性的组合特征作为模型输入。在交叉验证和独立验证数据集上的基准实验表明,所提出的CRF-PPI优于基于序列的最新PPI预测器。CRF-PPI的源代码和基准数据集可在http://csbio.njust.edu.cn/bioinf/CRF-PPI上在线获取,供学术免费使用。