Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA.
Cell Syst. 2023 Nov 15;14(11):979-989.e4. doi: 10.1016/j.cels.2023.10.001. Epub 2023 Oct 30.
Discovery and optimization of monoclonal antibodies for therapeutic applications relies on large sequence libraries but is hindered by developability issues such as low solubility, high aggregation, and high immunogenicity. Generative language models, trained on millions of protein sequences, are a powerful tool for the on-demand generation of realistic, diverse sequences. We present the Immunoglobulin Language Model (IgLM), a deep generative language model for creating synthetic antibody libraries. Compared with prior methods that leverage unidirectional context for sequence generation, IgLM formulates antibody design based on text-infilling in natural language, allowing it to re-design variable-length spans within antibody sequences using bidirectional context. We trained IgLM on 558 million (M) antibody heavy- and light-chain variable sequences, conditioning on each sequence's chain type and species of origin. We demonstrate that IgLM can generate full-length antibody sequences from a variety of species and its infilling formulation allows it to generate infilled complementarity-determining region (CDR) loop libraries with improved in silico developability profiles. A record of this paper's transparent peer review process is included in the supplemental information.
治疗性单克隆抗体的发现和优化依赖于大型序列文库,但由于可开发性问题(如低溶解度、高聚集性和高免疫原性)而受到阻碍。基于数百万个蛋白质序列进行训练的生成式语言模型是按需生成逼真、多样化序列的强大工具。我们提出了免疫球蛋白语言模型(IgLM),这是一种用于创建合成抗体文库的深度生成式语言模型。与以前利用单向上下文进行序列生成的方法相比,IgLM 根据自然语言中的文本填充来制定抗体设计,允许它使用双向上下文重新设计抗体序列中的可变长度跨度。我们在 5.58 亿(M)个抗体重链和轻链可变序列上对 IgLM 进行了训练,对每个序列的链类型和来源物种进行了条件处理。我们证明了 IgLM 可以从多种物种生成全长抗体序列,并且其填充公式允许它生成填充互补决定区(CDR)环文库,具有改进的计算可开发性特征。本文透明同行评审过程的记录包含在补充信息中。