Gelman Sam, Johnson Bryce, Freschlin Chase R, Sharma Arnav, D'Costa Sameer, Peters John, Gitter Anthony, Romero Philip A
Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA.
Morgridge Institute for Research, Madison, WI, USA.
Nat Methods. 2025 Sep 11. doi: 10.1038/s41592-025-02776-2.
Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose mutational effect transfer learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure and energetics. We fine-tune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
基于进化数据训练的蛋白质语言模型已成为解决涉及蛋白质序列、结构和功能的预测问题的强大工具。然而,这些模型忽略了数十年来对控制蛋白质功能的生物物理因素的研究。我们提出了突变效应迁移学习(METL),这是一个将先进的机器学习和生物物理建模相结合的蛋白质语言模型框架。使用METL框架,我们在生物物理模拟数据上对基于Transformer的神经网络进行预训练,以捕捉蛋白质序列、结构和能量学之间的基本关系。我们在实验序列-功能数据上对METL进行微调,以利用这些生物物理信号,并在预测热稳定性、催化活性和荧光等蛋白质特性时应用它们。METL在具有挑战性的蛋白质工程任务中表现出色,例如从小训练集中进行泛化和位置外推,尽管基于进化信号训练的现有方法在许多类型的实验分析中仍然很强大。我们展示了METL在仅以64个示例进行训练时设计功能性绿色荧光蛋白变体的能力,展示了基于生物物理学的蛋白质语言模型在蛋白质工程中的潜力。