School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, Hubei, China.
Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, 430065, Hubei, China.
BMC Bioinformatics. 2021 Oct 25;22(Suppl 3):522. doi: 10.1186/s12859-021-04420-0.
In the process of designing drugs and proteins, it is crucial to recognize hot regions in protein-protein interactions. Each hot region of protein-protein interaction is composed of at least three hot spots, which play an important role in binding. However, it takes time and labor force to identify hot spots through biological experiments. If predictive models based on machine learning methods can be trained, the drug design process can be effectively accelerated.
The results show that different machine learning algorithms perform similarly, as evaluating using the F-measure. The main differences between these methods are recall and precision. Since the key attribute of hot regions is that they are packed tightly, we used the cluster algorithm to predict hot regions. By combining Gaussian Naïve Bayes and DBSCAN, the F-measure of hot region prediction can reach 0.809.
In this paper, different machine learning models such as Gaussian Naïve Bayes, SVM, Xgboost, Random Forest, and Artificial Neural Network are used to predict hot spots. The experiment results show that the combination of hot spot classification algorithm with higher recall rate and clustering algorithm with higher precision can effectively improve the accuracy of hot region prediction.
在药物和蛋白质设计过程中,识别蛋白质-蛋白质相互作用中的热点区域至关重要。每个蛋白质-蛋白质相互作用的热点区域至少由三个热点组成,这些热点在结合中起着重要作用。然而,通过生物实验来识别热点需要耗费大量的时间和劳动力。如果能够训练基于机器学习方法的预测模型,那么药物设计过程将得到有效加速。
结果表明,不同的机器学习算法在使用 F 度量进行评估时表现相似。这些方法之间的主要区别在于召回率和精度。由于热点区域的主要属性是它们紧密包装,因此我们使用聚类算法来预测热点区域。通过结合高斯朴素贝叶斯和 DBSCAN,热点预测的 F 度量可以达到 0.809。
本文使用了高斯朴素贝叶斯、SVM、Xgboost、随机森林和人工神经网络等不同的机器学习模型来预测热点。实验结果表明,将具有更高召回率的热点分类算法与具有更高精度的聚类算法相结合,可以有效地提高热点区域预测的准确性。