Suppr超能文献

使用基于多序列比对的语言模型增强对单突变诱导的蛋白质稳定性变化的预测。

Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models.

作者信息

Cuturello Francesca, Celoria Marco, Ansuini Alessio, Cazzaniga Alberto

机构信息

AREA Science Park, Trieste, 34149, Italy.

CINECA National Supercomputing Center, Bologna, 40033, Italy.

出版信息

Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae447.

Abstract

MOTIVATION

Protein Language Models offer a new perspective for addressing challenges in structural biology, while relying solely on sequence information. Recent studies have investigated their effectiveness in forecasting shifts in thermodynamic stability caused by single amino acid mutations, a task known for its complexity due to the sparse availability of data, constrained by experimental limitations. To tackle this problem, we introduce two key novelties: leveraging a Protein Language Model that incorporates Multiple Sequence Alignments to capture evolutionary information, and using a recently released mega-scale dataset with rigorous data pre-processing to mitigate overfitting.

RESULTS

We ensure comprehensive comparisons by fine-tuning various pre-trained models, taking advantage of analyses such as ablation studies and baselines evaluation. Our methodology introduces a stringent policy to reduce the widespread issue of data leakage, rigorously removing sequences from the training set when they exhibit significant similarity with the test set. The MSA Transformer emerges as the most accurate among the models under investigation, given its capability to leverage co-evolution signals encoded in aligned homologous sequences. Moreover, the optimized MSA Transformer outperforms existing methods and exhibits enhanced generalization power, leading to a notable improvement in predicting changes in protein stability resulting from point mutations.

AVAILABILITY AND IMPLEMENTATION

Code and data at https://github.com/RitAreaSciencePark/PLM4Muts.

SUPPLEMENTARY INFORMATION

Supplementary Information is available at Bioinformatics online.

摘要

动机

蛋白质语言模型为解决结构生物学中的挑战提供了一个新视角,同时仅依赖序列信息。最近的研究调查了它们在预测单个氨基酸突变引起的热力学稳定性变化方面的有效性,由于数据稀疏以及实验限制,这一任务因复杂性而闻名。为了解决这个问题,我们引入了两个关键创新点:利用一个整合了多序列比对以捕捉进化信息的蛋白质语言模型,以及使用一个经过严格数据预处理的最近发布的大规模数据集来减轻过拟合。

结果

我们通过微调各种预训练模型进行全面比较,利用诸如消融研究和基线评估等分析方法。我们的方法引入了一项严格政策来减少普遍存在的数据泄露问题,当训练集中的数据与测试集表现出显著相似性时,严格将其从训练集中移除。在研究的模型中,MSA Transformer表现最为准确,因为它能够利用比对的同源序列中编码的共进化信号。此外,优化后的MSA Transformer优于现有方法,并展现出更强的泛化能力,在预测点突变导致的蛋白质稳定性变化方面有显著提升。

可用性与实现

代码和数据可在https://github.com/RitAreaSciencePark/PLM4Muts获取。

补充信息

补充信息可在《生物信息学》在线版获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c09/11269464/88b0646648c1/btae447f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验