Liu Quanya, Chen Peng, Wang Bing, Zhang Jun, Li Jinyan
Institute of Physical Science and Information Technology, Anhui University, Hefei, Anhui, 230601, China.
School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan, Anhui, 243032, China.
BMC Syst Biol. 2018 Dec 31;12(Suppl 9):132. doi: 10.1186/s12918-018-0665-8.
Hot spot residues are functional sites in protein interaction interfaces. The identification of hot spot residues is time-consuming and laborious using experimental methods. In order to address the issue, many computational methods have been developed to predict hot spot residues. Moreover, most prediction methods are based on structural features, sequence characteristics, and/or other protein features.
This paper proposed an ensemble learning method to predict hot spot residues that only uses sequence features and the relative accessible surface area of amino acid sequences. In this work, a novel feature selection technique was developed, an auto-correlation function combined with a sliding window technique was applied to obtain the characteristics of amino acid residues in protein sequence, and an ensemble classifier with SVM and KNN base classifiers was built to achieve the best classification performance.
The experimental results showed that our model yields the highest F1 score of 0.92 and an MCC value of 0.87 on ASEdb dataset. Compared with other machine learning methods, our model achieves a big improvement in hot spot prediction.
热点残基是蛋白质相互作用界面中的功能位点。使用实验方法鉴定热点残基既耗时又费力。为了解决这个问题,已经开发了许多计算方法来预测热点残基。此外,大多数预测方法基于结构特征、序列特征和/或其他蛋白质特征。
本文提出了一种仅使用序列特征和氨基酸序列的相对可及表面积来预测热点残基的集成学习方法。在这项工作中,开发了一种新颖的特征选择技术,应用自相关函数与滑动窗口技术相结合来获取蛋白质序列中氨基酸残基的特征,并构建了一个以支持向量机(SVM)和K近邻(KNN)为基础分类器的集成分类器,以实现最佳分类性能。
实验结果表明,我们的模型在ASEdb数据集上的F1分数最高可达0.92,马修斯相关系数(MCC)值为0.87。与其他机器学习方法相比,我们的模型在热点预测方面有了很大改进。