利用结构信息语言模型对蛋白质复合物进行反向折叠可实现无监督抗体进化。

Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution.

作者信息

Shanker Varun R, Bruun Theodora U J, Hie Brian L, Kim Peter S

机构信息

Stanford Biophysics Program, Stanford University School of Medicine, Stanford, CA 94305, USA.

Stanford Medical Scientist Training Program, Stanford University School of Medicine, Stanford CA 94305, USA.

出版信息

bioRxiv. 2023 Dec 21:2023.12.19.572475. doi: 10.1101/2023.12.19.572475.

DOI:10.1101/2023.12.19.572475

PMID:38187780

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10769282/

Abstract

Large language models trained on sequence information alone are capable of learning high level principles of protein design. However, beyond sequence, the three-dimensional structures of proteins determine their specific function, activity, and evolvability. Here we show that a general protein language model augmented with protein structure backbone coordinates and trained on the inverse folding problem can guide evolution for diverse proteins without needing to explicitly model individual functional tasks. We demonstrate inverse folding to be an effective unsupervised, structure-based sequence optimization strategy that also generalizes to multimeric complexes by implicitly learning features of binding and amino acid epistasis. Using this approach, we screened ~30 variants of two therapeutic clinical antibodies used to treat SARS-CoV-2 infection and achieved up to 26-fold improvement in neutralization and 37-fold improvement in affinity against antibody-escaped viral variants-of-concern BQ.1.1 and XBB.1.5, respectively. In addition to substantial overall improvements in protein function, we find inverse folding performs with leading experimental success rates among other reported machine learning-guided directed evolution methods, without requiring any task-specific training data.

摘要

仅基于序列信息训练的大语言模型能够学习蛋白质设计的高级原理。然而，除了序列之外，蛋白质的三维结构决定了它们的特定功能、活性和进化能力。在这里，我们表明，一种通过蛋白质结构主链坐标增强并在反向折叠问题上进行训练的通用蛋白质语言模型，可以指导多种蛋白质的进化，而无需明确模拟个体功能任务。我们证明反向折叠是一种有效的无监督、基于结构的序列优化策略，通过隐式学习结合和氨基酸上位性的特征，它也适用于多聚体复合物。使用这种方法，我们筛选了用于治疗SARS-CoV-2感染的两种治疗性临床抗体的约30个变体，分别使针对抗体逃逸的关注病毒变体BQ.1.1和XBB.1.5的中和能力提高了26倍，亲和力提高了37倍。除了蛋白质功能的大幅整体改善外，我们发现反向折叠在其他已报道的机器学习指导的定向进化方法中具有领先的实验成功率，且无需任何特定任务的训练数据。