Suppr超能文献

通过在大规模数据集上微调蛋白质语言模型进行蛋白质稳定性预测。

Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset.

作者信息

Chu Simon K S, Narang Kush, Siegel Justin B

机构信息

Biophysics Graduate Program, University of California Davis, Davis, California, United States of America.

College of Biological Sciences, University of California Davis, Davis, California, United States of America.

出版信息

PLoS Comput Biol. 2024 Jul 22;20(7):e1012248. doi: 10.1371/journal.pcbi.1012248. eCollection 2024 Jul.

Abstract

Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESMtherm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESMtherm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model's limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.

摘要

蛋白质稳定性在多种应用中起着至关重要的作用,如食品加工、治疗学以及致病突变的鉴定。工程研究通常致力于提高蛋白质稳定性,并且人们强烈希望简化这些过程,以便能够通过更少的迭代快速优化高度稳定的蛋白质。在这项工作中,我们探索利用一个大规模数据集来开发一个针对稳定性预测进行优化的蛋白质语言模型。ESMtherm在来自461个蛋白质结构域的52.8万个天然和从头设计的序列的折叠稳定性上进行训练,并且能够适应缺失、插入和多点突变。我们表明,蛋白质语言模型可以进行微调以预测折叠稳定性。ESMtherm在小蛋白质结构域上表现良好,并能推广到与训练集距离较远的序列。最后,我们讨论了与其他最先进方法相比,我们的模型在推广到更大蛋白质支架方面的局限性。我们的结果强调了在一个反映自然界中常见序列长度分布的多样化数据集上进行大规模稳定性测量的必要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af8f/11293664/8d5c7b099923/pcbi.1012248.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验