Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Ensenada, Mexico.
Cátedras CONAHCYT - Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Ensenada, Mexico.
Protein Sci. 2024 Apr;33(4):e4928. doi: 10.1002/pro.4928.
Molecular features play an important role in different bio-chem-informatics tasks, such as the Quantitative Structure-Activity Relationships (QSAR) modeling. Several pre-trained models have been recently created to be used in downstream tasks, either by fine-tuning a specific model or by extracting features to feed traditional classifiers. In this regard, a new family of Evolutionary Scale Modeling models (termed as ESM-2 models) was recently introduced, demonstrating outstanding results in protein structure prediction benchmarks. Herein, we studied the usefulness of the different-dimensional embeddings derived from the ESM-2 models to classify antimicrobial peptides (AMPs). To this end, we built a KNIME workflow to use the same modeling methodology across experiments in order to guarantee fair analyses. As a result, the 640- and 1280-dimensional embeddings derived from the 30- and 33-layer ESM-2 models, respectively, are the most valuable since statistically better performances were achieved by the QSAR models built from them. We also fused features of the different ESM-2 models, and it was concluded that the fusion contributes to getting better QSAR models than using features of a single ESM-2 model. Frequency studies revealed that only a portion of the ESM-2 embeddings is valuable for modeling tasks since between 43% and 66% of the features were never used. Comparisons regarding state-of-the-art deep learning (DL) models confirm that when performing methodologically principled studies in the prediction of AMPs, non-DL based QSAR models yield comparable-to-superior performances to DL-based QSAR models. The developed KNIME workflow is available-freely at https://github.com/cicese-biocom/classification-QSAR-bioKom. This workflow can be valuable to avoid unfair comparisons regarding new computational methods, as well as to propose new non-DL based QSAR models.
分子特征在不同的生物化学信息学任务中起着重要作用,例如定量构效关系 (QSAR) 建模。最近创建了几个预训练模型,可用于下游任务,无论是通过微调特定模型还是提取特征来为传统分类器提供信息。在这方面,最近引入了一组新的进化尺度建模模型(称为 ESM-2 模型),在蛋白质结构预测基准测试中取得了出色的结果。在这里,我们研究了从 ESM-2 模型得出的不同维度的嵌入在分类抗菌肽 (AMP) 中的有用性。为此,我们构建了一个 KNIME 工作流程,以便在实验中使用相同的建模方法,以保证公平的分析。结果表明,来自 30 层和 33 层 ESM-2 模型的 640 维和 1280 维嵌入是最有价值的,因为从它们构建的 QSAR 模型在统计学上表现更好。我们还融合了不同 ESM-2 模型的特征,结论是融合有助于获得比使用单个 ESM-2 模型特征更好的 QSAR 模型。频率研究表明,对于建模任务,只有一部分 ESM-2 嵌入是有价值的,因为在 43%到 66%的特征从未被使用过。与最先进的深度学习 (DL) 模型的比较证实,在 AMP 预测中进行基于方法论的原则研究时,基于非 DL 的 QSAR 模型的性能可与基于 DL 的 QSAR 模型相媲美。开发的 KNIME 工作流程可在 https://github.com/cicese-biocom/classification-QSAR-bioKom 上免费获得。该工作流程对于避免关于新计算方法的不公平比较以及提出新的基于非 DL 的 QSAR 模型非常有价值。