通过在大规模数据集上微调蛋白质语言模型进行蛋白质稳定性预测。

Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset.

作者信息

Chu Simon K S, Narang Kush, Siegel Justin B

机构信息

Biophysics Graduate Program, University of California Davis, Davis, California, United States of America.

College of Biological Sciences, University of California Davis, Davis, California, United States of America.

出版信息

PLoS Comput Biol. 2024 Jul 22;20(7):e1012248. doi: 10.1371/journal.pcbi.1012248. eCollection 2024 Jul.

DOI:10.1371/journal.pcbi.1012248

PMID:39038042

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11293664/

Abstract

Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESMtherm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESMtherm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model's limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.

摘要

蛋白质稳定性在多种应用中起着至关重要的作用，如食品加工、治疗学以及致病突变的鉴定。工程研究通常致力于提高蛋白质稳定性，并且人们强烈希望简化这些过程，以便能够通过更少的迭代快速优化高度稳定的蛋白质。在这项工作中，我们探索利用一个大规模数据集来开发一个针对稳定性预测进行优化的蛋白质语言模型。ESMtherm在来自461个蛋白质结构域的52.8万个天然和从头设计的序列的折叠稳定性上进行训练，并且能够适应缺失、插入和多点突变。我们表明，蛋白质语言模型可以进行微调以预测折叠稳定性。ESMtherm在小蛋白质结构域上表现良好，并能推广到与训练集距离较远的序列。最后，我们讨论了与其他最先进方法相比，我们的模型在推广到更大蛋白质支架方面的局限性。我们的结果强调了在一个反映自然界中常见序列长度分布的多样化数据集上进行大规模稳定性测量的必要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af8f/11293664/8d5c7b099923/pcbi.1012248.g001.jpg

相似文献

Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset.通过在大规模数据集上微调蛋白质语言模型进行蛋白质稳定性预测。

PLoS Comput Biol. 2024 Jul 22;20(7):e1012248. doi: 10.1371/journal.pcbi.1012248. eCollection 2024 Jul.

Fine-tuning protein language models boosts predictions across diverse tasks.微调蛋白质语言模型可提高跨多种任务的预测能力。

Nat Commun. 2024 Aug 28;15(1):7407. doi: 10.1038/s41467-024-51844-2.

Mega-scale experimental analysis of protein folding stability in biology and design.大规模实验分析生物学和设计中的蛋白质折叠稳定性。

Nature. 2023 Aug;620(7973):434-444. doi: 10.1038/s41586-023-06328-6. Epub 2023 Jul 19.

Systematic analysis of short internal indels and their impact on protein folding.短内部插入缺失及其对蛋白质折叠影响的系统分析。

BMC Struct Biol. 2010 Aug 4;10:24. doi: 10.1186/1472-6807-10-24.

Computational modeling of protein mutant stability: analysis and optimization of statistical potentials and structural features reveal insights into prediction model development.蛋白质突变体稳定性的计算建模：统计势和结构特征的分析与优化为预测模型开发提供了见解。

BMC Struct Biol. 2007 Aug 16;7:54. doi: 10.1186/1472-6807-7-54.

Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations.评估用于估计错义突变后蛋白质稳定性变化的计算预测器的性能。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab184.

Reviewing Challenges of Predicting Protein Melting Temperature Change Upon Mutation Through the Full Analysis of a Highly Detailed Dataset with High-Resolution Structures.通过对具有高分辨率结构的高度详细数据集进行全面分析来预测蛋白质突变时的熔融温度变化的挑战综述。

Mol Biotechnol. 2021 Oct;63(10):863-884. doi: 10.1007/s12033-021-00349-0. Epub 2021 Jun 8.

Machine learning algorithms for predicting protein folding rates and stability of mutant proteins: comparison with statistical methods.用于预测蛋白质折叠速率和突变蛋白稳定性的机器学习算法：与统计方法的比较。

Curr Protein Pept Sci. 2011 Sep;12(6):490-502. doi: 10.2174/138920311796957630.

De novo structure prediction of globular proteins aided by sequence variation-derived contacts.基于序列变异衍生接触辅助的球状蛋白质从头结构预测。

PLoS One. 2014 Mar 17;9(3):e92197. doi: 10.1371/journal.pone.0092197. eCollection 2014.

INPS: predicting the impact of non-synonymous variations on protein stability from sequence.INPS：从序列预测非同义变异对蛋白质稳定性的影响。

Bioinformatics. 2015 Sep 1;31(17):2816-21. doi: 10.1093/bioinformatics/btv291. Epub 2015 May 7.

引用本文的文献

Accurate Prediction of Protein Tertiary and Quaternary Stability Using Fine-Tuned Protein Language Models and Free Energy Perturbation.使用微调蛋白质语言模型和自由能微扰准确预测蛋白质三级和四级结构稳定性

Int J Mol Sci. 2025 Jul 24;26(15):7125. doi: 10.3390/ijms26157125.

Protein Language Model Identifies Disordered, Conserved Motifs Implicated in Phase Separation.蛋白质语言模型识别出与相分离相关的无序保守基序。

bioRxiv. 2025 Jul 23:2024.12.12.628175. doi: 10.1101/2024.12.12.628175.

Unifying perspectives on the activity and genotypic targeting of pharmacological chaperones.关于药理伴侣活性和基因型靶向的统一观点。

J Biol Chem. 2025 Jun 18;301(7):110375. doi: 10.1016/j.jbc.2025.110375.

EnGCI: enhancing GPCR-compound interaction prediction via large molecular models and KAN network.EnGCI：通过大分子模型和KAN网络增强GPCR-化合物相互作用预测

BMC Biol. 2025 May 15;23(1):136. doi: 10.1186/s12915-025-02238-3.

Rewiring protein sequence and structure generative models to enhance protein stability prediction.重新调整蛋白质序列和结构生成模型以增强蛋白质稳定性预测。

bioRxiv. 2025 Feb 18:2025.02.13.638154. doi: 10.1101/2025.02.13.638154.

Predicting absolute protein folding stability using generative models.使用生成模型预测蛋白质绝对折叠稳定性

Protein Sci. 2025 Jan;34(1):e5233. doi: 10.1002/pro.5233.

Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models.使用基于多序列比对的语言模型增强对单突变诱导的蛋白质稳定性变化的预测。

Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae447.

本文引用的文献

Toward enhancement of antibody thermostability and affinity by computational design in the absence of antigen.通过无抗原的计算设计提高抗体的热稳定性和亲和力。

MAbs. 2024 Jan-Dec;16(1):2362775. doi: 10.1080/19420862.2024.2362775. Epub 2024 Jun 20.

ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction.蛋白质健身房：蛋白质设计与适应性预测的大规模基准测试

bioRxiv. 2023 Dec 8:2023.12.07.570727. doi: 10.1101/2023.12.07.570727.

Accurate proteome-wide missense variant effect prediction with AlphaMissense.使用 AlphaMissense 进行精确的全蛋白质错义变异效应预测。

Science. 2023 Sep 22;381(6664):eadg7492. doi: 10.1126/science.adg7492.

ProS-GNN: Predicting effects of mutations on protein stability using graph neural networks.ProS-GNN：使用图神经网络预测突变对蛋白质稳定性的影响。

Comput Biol Chem. 2023 Dec;107:107952. doi: 10.1016/j.compbiolchem.2023.107952. Epub 2023 Aug 26.

Mega-scale experimental analysis of protein folding stability in biology and design.大规模实验分析生物学和设计中的蛋白质折叠稳定性。

Nature. 2023 Aug;620(7973):434-444. doi: 10.1038/s41586-023-06328-6. Epub 2023 Jul 19.

Protein length distribution is remarkably uniform across the tree of life.蛋白质长度分布在整个生命之树上都非常均匀。

Genome Biol. 2023 Jun 8;24(1):135. doi: 10.1186/s13059-023-02973-2.

Rapid protein stability prediction using deep learning representations.利用深度学习表示进行快速蛋白质稳定性预测。

Elife. 2023 May 15;12:e82593. doi: 10.7554/eLife.82593.

Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。

Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

De novo protein design by deep network hallucination.基于深度网络幻觉的从头设计蛋白质。

Nature. 2021 Dec;600(7889):547-552. doi: 10.1038/s41586-021-04184-w. Epub 2021 Dec 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过在大规模数据集上微调蛋白质语言模型进行蛋白质稳定性预测。

Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献