Suppr超能文献

通过在大规模数据集上微调蛋白质语言模型进行蛋白质稳定性预测。

Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset.

作者信息

Chu Simon K S, Narang Kush, Siegel Justin B

机构信息

Biophysics Graduate Program, University of California Davis, Davis, California, United States of America.

College of Biological Sciences, University of California Davis, Davis, California, United States of America.

出版信息

PLoS Comput Biol. 2024 Jul 22;20(7):e1012248. doi: 10.1371/journal.pcbi.1012248. eCollection 2024 Jul.

Abstract

Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESMtherm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESMtherm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model's limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.

摘要

蛋白质稳定性在多种应用中起着至关重要的作用,如食品加工、治疗学以及致病突变的鉴定。工程研究通常致力于提高蛋白质稳定性,并且人们强烈希望简化这些过程,以便能够通过更少的迭代快速优化高度稳定的蛋白质。在这项工作中,我们探索利用一个大规模数据集来开发一个针对稳定性预测进行优化的蛋白质语言模型。ESMtherm在来自461个蛋白质结构域的52.8万个天然和从头设计的序列的折叠稳定性上进行训练,并且能够适应缺失、插入和多点突变。我们表明,蛋白质语言模型可以进行微调以预测折叠稳定性。ESMtherm在小蛋白质结构域上表现良好,并能推广到与训练集距离较远的序列。最后,我们讨论了与其他最先进方法相比,我们的模型在推广到更大蛋白质支架方面的局限性。我们的结果强调了在一个反映自然界中常见序列长度分布的多样化数据集上进行大规模稳定性测量的必要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af8f/11293664/8d5c7b099923/pcbi.1012248.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验