基于对接数据，使用基于树的机器学习方法筛选亲和肽。

Use of tree-based machine learning methods to screen affinitive peptides based on docking data.

作者信息

Feng Hua, Wang Fangyu, Li Ning, Xu Qian, Zheng Guanming, Sun Xuefeng, Hu Man, Li Xuewu, Xing Guangxu, Zhang Gaiping

机构信息

Henan Key Laboratory of Animal Immunology, Henan Academy of Agricultural Sciences, Zhengzhou, China.

College of Food Science and Technology, Henan Agricultural University, Zhengzhou, China.

出版信息

Mol Inform. 2023 Dec;42(12):e202300143. doi: 10.1002/minf.202300143. Epub 2023 Nov 9.

DOI:10.1002/minf.202300143

PMID:37696773

Abstract

Screening peptides with good affinity is an important step in peptide-drug discovery. Recent advancement in computer and data science have made machine learning a useful tool in accurately affinitive-peptide screening. In current study, four different tree-based algorithms, including Classification and regression trees (CART), C5.0 decision tree (C50), Bagged CART (BAG) and Random Forest (RF), were employed to explore the relationship between experimental peptide affinities and virtual docking data, and the performance of each model was also compared in parallel. All four algorithms showed better performances on dataset pre-scaled, -centered and -PCA than other pre-processed dataset. After model re-built and hyperparameter optimization, the optimal C50 model (C50O) showed the best performances in terms of Accuracy, Kappa, Sensitivity, Specificity, F1, MCC and AUC when validated on test data and an unknown PEDV datasets evaluation (Accuracy=80.4 %). BAG and RFO (the optimal RF), as two best models during training process, did not performed as expecting during in testing and unknown dataset validations. Furthermore, the high correlation of the predictions of RFO and BAG to C50O implied the high stability and robustness of their prediction. Whereas although the good performance on unknown dataset, the poor performance in test data validation and correlation analysis indicated CARTO could not be used for future data prediction. To accurately evaluate the peptide affinity, the current study firstly gave a tree-model competition on affinitive peptide prediction by using virtual docking data, which would expand the application of machine learning algorithms in studying PepPIs and benefit the development of peptide therapeutics.

摘要

筛选具有良好亲和力的肽段是肽类药物研发中的重要一步。计算机和数据科学的最新进展使机器学习成为准确筛选亲和性肽段的有用工具。在本研究中，采用了四种不同的基于树的算法，包括分类与回归树（CART）、C5.0决策树（C50）、袋装CART（BAG）和随机森林（RF），以探索实验肽段亲和力与虚拟对接数据之间的关系，并对每个模型的性能进行了并行比较。与其他预处理数据集相比，所有四种算法在经过预缩放、中心化和主成分分析（PCA）的数据集上表现更好。在模型重建和超参数优化后，最优的C50模型（C50O）在测试数据和未知猪流行性腹泻病毒（PEDV）数据集评估中进行验证时，在准确率、卡帕值、灵敏度、特异性、F1值、马修斯相关系数（MCC）和曲线下面积（AUC）方面表现最佳（准确率=80.4%）。BAG和RFO（最优的RF）作为训练过程中的两个最佳模型，在测试和未知数据集验证期间的表现未达预期。此外，RFO和BAG与C50O预测结果的高度相关性表明它们预测的高稳定性和稳健性。尽管在未知数据集上表现良好，但在测试数据验证和相关性分析中的不佳表现表明CART不能用于未来的数据预测。为了准确评估肽段亲和力，本研究首次通过使用虚拟对接数据对亲和性肽段预测进行了树模型竞争，这将扩大机器学习算法在肽-蛋白相互作用研究中的应用，并有利于肽类治疗药物的开发。