University of Rijeka, Faculty of Engineering, 51000 Rijeka, Croatia.
University of Rijeka, Department of Biotechnology, 51000 Rijeka, Croatia.
J Chem Inf Model. 2022 Jun 27;62(12):2961-2972. doi: 10.1021/acs.jcim.2c00526. Epub 2022 Jun 15.
The discovery of therapeutic peptides is often accelerated by means of virtual screening supported by machine learning-based predictive models. The predictive performance of such models is sensitive to the choice of data and its representation scheme. While the peptide physicochemical and compositional representations fail to distinguish sequence permutations, the amino acid arrangement within the sequence lacks the important information contained in physicochemical, conformational, topological, and geometrical properties. In this paper, we propose a solution to the identified information gap by implementing a hybrid scheme that complements the best traits from both approaches with the aim of predicting antimicrobial and antiviral activities based on experimental data from DRAMP 2.0, AVPdb, and Uniprot data repositories. Using the Friedman test of statistical significance, we compared our hybrid, approach to peptide properties, one-hot vector encoding, and word embedding schemes in the 10-fold cross-validation setting, with respect to the F1 score, Matthews correlation coefficient, geometric mean, recall, and precision evaluation metrics. Moreover, the sequence modeling neural network was employed to gain insight into the synergic effect of both properties- and amino acid order-based predictions. The results suggest that significantly ( < 0.01) surpasses the aforementioned state-of-the-art representation schemes. This makes it a strong candidate for increasing the predictive power of screening methods based on machine learning, applicable to any category of peptides.
治疗性肽的发现通常通过基于机器学习的预测模型支持的虚拟筛选来加速。此类模型的预测性能对数据及其表示方案的选择敏感。虽然肽的理化和组成表示法无法区分序列排列,但序列中氨基酸的排列方式缺乏理化、构象、拓扑和几何性质中包含的重要信息。在本文中,我们通过实施混合方案来解决已识别的信息差距,该方案结合了两种方法的最佳特点,旨在根据来自 DRAMP 2.0、AVPdb 和 Uniprot 数据存储库的实验数据预测抗菌和抗病毒活性。使用统计显着性的 Friedman 检验,我们比较了我们的混合方法、肽特性的方法、独热向量编码方法和词嵌入方案在 10 倍交叉验证设置中的 F1 分数、马修斯相关系数、几何平均值、召回率和精度评估指标。此外,还使用序列建模神经网络深入了解基于特性和氨基酸顺序的预测的协同作用。结果表明,性能显著(<0.01)优于上述最先进的表示方案。这使其成为增强基于机器学习的筛选方法预测能力的有力候选者,适用于任何肽类别。