Chu Simon K S, Narang Kush, Siegel Justin B
Biophysics Graduate Program, University of California Davis, Davis, California, United States of America.
College of Biological Sciences, University of California Davis, Davis, California, United States of America.
PLoS Comput Biol. 2024 Jul 22;20(7):e1012248. doi: 10.1371/journal.pcbi.1012248. eCollection 2024 Jul.
Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESMtherm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESMtherm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model's limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.
蛋白质稳定性在多种应用中起着至关重要的作用,如食品加工、治疗学以及致病突变的鉴定。工程研究通常致力于提高蛋白质稳定性,并且人们强烈希望简化这些过程,以便能够通过更少的迭代快速优化高度稳定的蛋白质。在这项工作中,我们探索利用一个大规模数据集来开发一个针对稳定性预测进行优化的蛋白质语言模型。ESMtherm在来自461个蛋白质结构域的52.8万个天然和从头设计的序列的折叠稳定性上进行训练,并且能够适应缺失、插入和多点突变。我们表明,蛋白质语言模型可以进行微调以预测折叠稳定性。ESMtherm在小蛋白质结构域上表现良好,并能推广到与训练集距离较远的序列。最后,我们讨论了与其他最先进方法相比,我们的模型在推广到更大蛋白质支架方面的局限性。我们的结果强调了在一个反映自然界中常见序列长度分布的多样化数据集上进行大规模稳定性测量的必要性。