IEEE/ACM Trans Comput Biol Bioinform. 2020 Sep-Oct;17(5):1525-1534. doi: 10.1109/TCBB.2019.2931717. Epub 2019 Jul 30.
Proteins are not isolated biological molecules, which have the specific three-dimensional structures and interact with other proteins to perform functions. A small number of residues (hot spots) in protein-protein interactions (PPIs) play the vital role in bioinformatics to influence and control of biological processes. This paper uses the boosting algorithm and gradient boosting algorithm based on two feature selection strategies to classify hot spots with three common datasets and two hub protein datasets. First, the correlation-based feature selection is used to remove the highly related features for improving accuracy of prediction. Then, the recursive feature elimination based on support vector machine (SVM-RFE) is adopted to select the optimal feature subset to improve the training performance. Finally, boosting and gradient boosting (G-boosting) methods are invoked to generate classification results. Gradient boosting is capable of obtaining an excellent model by reducing the loss function in the gradient direction to avoid overfitting. Five datasets from different protein databases are used to verify our models in the experiments. Experimental results show that our proposed classification models have the competitive performance compared with existing classification methods.
蛋白质不是孤立的生物分子,它们具有特定的三维结构,并与其他蛋白质相互作用以发挥功能。蛋白质-蛋白质相互作用 (PPI) 中的少数残基(热点)在生物信息学中起着至关重要的作用,影响和控制着生物过程。本文使用基于提升算法和梯度提升算法的两种特征选择策略,对三个常见数据集和两个中心蛋白数据集的热点进行分类。首先,使用基于相关性的特征选择来去除高度相关的特征,以提高预测的准确性。然后,采用基于支持向量机的递归特征消除 (SVM-RFE) 选择最优的特征子集,以提高训练性能。最后,调用提升和梯度提升 (G-boosting) 方法生成分类结果。梯度提升通过在梯度方向上减少损失函数来避免过拟合,从而获得优秀的模型。实验中使用了来自不同蛋白质数据库的五个数据集来验证我们的模型。实验结果表明,与现有的分类方法相比,我们提出的分类模型具有竞争力。