National Engineering Research Center for Agro-Ecological Big Data Analysis and Application, School of Internet and Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China.
School of Computer Science and Technology, Anhui University, Hefei, 230601, Anhui, China.
Amino Acids. 2022 May;54(5):765-776. doi: 10.1007/s00726-022-03129-5. Epub 2022 Jan 30.
Protein hot spot residues are functional sites in protein-protein interactions. Biological experimental methods are traditionally used to identify hot spot residues, which is laborious and time-consuming. Thus a variety of computational methods were widely used in recent years. Despite the success of computational methods in hot spot identification, most of them are impractical in reality because they can recognize hot spot residues only from known protein-protein interface residues. Therefore, identifying hot spots from whole protein sequence is a meaningful and interesting issue. However, it will bring extreme imbalance between positive and negative samples. Hot spot residues only account for about 1-2% of whole protein sequences. To address the issue, this paper proposes a two-step ensemble model for identifying hot spot residues from extremely unbalanced data set. The model is composed of 134 classifiers constructed by base KNN and SVM. Compared to the previous methods, our model yields good performance with an F1 score of 0.593 on the BID test set. Furthermore, to validate the robustness of our model, it was tested on other three independent test sets and also achieved good predictions. More importantly, the performance of our model tested on unbalanced data set is comparable with other methods tested on balanced hot spot data set.
蛋白质热点残基是蛋白质-蛋白质相互作用中的功能位点。传统上使用生物实验方法来识别热点残基,但这种方法既费力又耗时。因此,近年来广泛使用了各种计算方法。尽管计算方法在热点识别方面取得了成功,但由于它们只能从已知的蛋白质-蛋白质界面残基中识别热点残基,因此在实际应用中大多数方法并不实用。因此,从整个蛋白质序列中识别热点是一个有意义和有趣的问题。然而,这将导致正样本和负样本之间极端不平衡。热点残基仅占整个蛋白质序列的 1-2%左右。为了解决这个问题,本文提出了一种两步集成模型,用于从极度不平衡的数据集识别热点残基。该模型由 134 个由基础 KNN 和 SVM 构建的分类器组成。与之前的方法相比,我们的模型在 BID 测试集上的 F1 得分为 0.593,性能良好。此外,为了验证我们模型的稳健性,我们还在其他三个独立的测试集上进行了测试,也取得了很好的预测结果。更重要的是,我们的模型在不平衡数据集上的性能与在平衡热点数据集上测试的其他方法相当。