IgLM：抗体序列设计的填充语言模型。

IgLM: Infilling language modeling for antibody sequence design.

机构信息

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.

Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA.

出版信息

Cell Syst. 2023 Nov 15;14(11):979-989.e4. doi: 10.1016/j.cels.2023.10.001. Epub 2023 Oct 30.

DOI:10.1016/j.cels.2023.10.001

PMID:37909045

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11018345/

Abstract

Discovery and optimization of monoclonal antibodies for therapeutic applications relies on large sequence libraries but is hindered by developability issues such as low solubility, high aggregation, and high immunogenicity. Generative language models, trained on millions of protein sequences, are a powerful tool for the on-demand generation of realistic, diverse sequences. We present the Immunoglobulin Language Model (IgLM), a deep generative language model for creating synthetic antibody libraries. Compared with prior methods that leverage unidirectional context for sequence generation, IgLM formulates antibody design based on text-infilling in natural language, allowing it to re-design variable-length spans within antibody sequences using bidirectional context. We trained IgLM on 558 million (M) antibody heavy- and light-chain variable sequences, conditioning on each sequence's chain type and species of origin. We demonstrate that IgLM can generate full-length antibody sequences from a variety of species and its infilling formulation allows it to generate infilled complementarity-determining region (CDR) loop libraries with improved in silico developability profiles. A record of this paper's transparent peer review process is included in the supplemental information.

摘要

治疗性单克隆抗体的发现和优化依赖于大型序列文库，但由于可开发性问题（如低溶解度、高聚集性和高免疫原性）而受到阻碍。基于数百万个蛋白质序列进行训练的生成式语言模型是按需生成逼真、多样化序列的强大工具。我们提出了免疫球蛋白语言模型（IgLM），这是一种用于创建合成抗体文库的深度生成式语言模型。与以前利用单向上下文进行序列生成的方法相比，IgLM 根据自然语言中的文本填充来制定抗体设计，允许它使用双向上下文重新设计抗体序列中的可变长度跨度。我们在 5.58 亿（M）个抗体重链和轻链可变序列上对 IgLM 进行了训练，对每个序列的链类型和来源物种进行了条件处理。我们证明了 IgLM 可以从多种物种生成全长抗体序列，并且其填充公式允许它生成填充互补决定区（CDR）环文库，具有改进的计算可开发性特征。本文透明同行评审过程的记录包含在补充信息中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b3a/11018345/9eafcddff324/nihms-1938138-f0001.jpg

相似文献

IgLM: Infilling language modeling for antibody sequence design.IgLM：抗体序列设计的填充语言模型。

Cell Syst. 2023 Nov 15;14(11):979-989.e4. doi: 10.1016/j.cels.2023.10.001. Epub 2023 Oct 30.

Drug-like antibodies with high affinity, diversity and developability directly from next-generation antibody libraries.高亲和力、多样性和可开发性的类药抗体可直接从下一代抗体文库中获得。

MAbs. 2021 Jan-Dec;13(1):1980942. doi: 10.1080/19420862.2021.1980942.

Phage-displayed antibody libraries of synthetic heavy chain complementarity determining regions.合成重链互补决定区的噬菌体展示抗体文库。

J Mol Biol. 2004 Apr 23;338(2):299-310. doi: 10.1016/j.jmb.2004.02.050.

Design and construction of synthetic phage-displayed Fab libraries.合成噬菌体展示Fab文库的设计与构建。

Methods Mol Biol. 2009;562:17-35. doi: 10.1007/978-1-60327-302-2_2.

A Novel Human scFv Library with Non-Combinatorial Synthetic CDR Diversity.一种具有非组合合成互补决定区多样性的新型人源单链抗体文库。

PLoS One. 2015 Oct 20;10(10):e0141045. doi: 10.1371/journal.pone.0141045. eCollection 2015.

Construction of a scFv Library with Synthetic, Non-combinatorial CDR Diversity.具有合成的、非组合式互补决定区多样性的单链抗体片段文库的构建。

Methods Mol Biol. 2017;1575:15-29. doi: 10.1007/978-1-4939-6857-2_2.

CDR-H3 diversity is not required for antigen recognition by synthetic antibodies.CDR-H3 多样性不是合成抗体识别抗原所必需的。

J Mol Biol. 2013 Feb 22;425(4):803-11. doi: 10.1016/j.jmb.2012.11.037. Epub 2012 Dec 3.

High-throughput generation of synthetic antibodies from highly functional minimalist phage-displayed libraries.从高度功能性的简约噬菌体展示文库中高通量生成合成抗体。

J Mol Biol. 2007 Nov 2;373(4):924-40. doi: 10.1016/j.jmb.2007.08.005. Epub 2007 Aug 19.

Characterization of a high-affinity human antibody with a disulfide bridge in the third complementarity-determining region of the heavy chain.鉴定重链第三个互补决定区带有二硫键的高亲和力人源抗体。

J Mol Recognit. 2012 Mar;25(3):125-35. doi: 10.1002/jmr.1168.

De novo design of antibody complementarity determining regions binding a FLAG tetra-peptide.从头设计与 FLAG 四肽结合的抗体互补决定区。

Sci Rep. 2017 Aug 31;7(1):10295. doi: 10.1038/s41598-017-10737-9.

引用本文的文献

SALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning.SALM：用于全面抗体表征学习的序列-结构预训练大语言模型。

Research (Wash D C). 2025 Aug 19;8:0721. doi: 10.34133/research.0721. eCollection 2025.

Computational nanobody design through deep generative modeling and epitope landscape profiling.通过深度生成建模和表位景观分析进行计算纳米抗体设计。

Comput Struct Biotechnol J. 2025 Jul 30;27:3443-3455. doi: 10.1016/j.csbj.2025.07.052. eCollection 2025.

Applications of Artificial Intelligence in Biotech Drug Discovery and Product Development.人工智能在生物技术药物发现与产品开发中的应用。

MedComm (2020). 2025 Jul 30;6(8):e70317. doi: 10.1002/mco2.70317. eCollection 2025 Aug.

Artificial intelligence-driven computational methods for antibody design and optimization.用于抗体设计与优化的人工智能驱动的计算方法。

MAbs. 2025 Dec;17(1):2528902. doi: 10.1080/19420862.2025.2528902. Epub 2025 Jul 18.

A Survey of Pretrained Protein Language Models.预训练蛋白质语言模型综述

Methods Mol Biol. 2025;2941:1-29. doi: 10.1007/978-1-0716-4623-6_1.

Focused learning by antibody language models using preferential masking of non-templated regions.通过对非模板化区域进行优先掩码处理，利用抗体语言模型进行聚焦学习。

Patterns (N Y). 2025 Apr 25;6(6):101239. doi: 10.1016/j.patter.2025.101239. eCollection 2025 Jun 13.

Generative Deep Learning Design of Single-Domain Antibodies Against Venezuelan Equine Encephalitis Virus.针对委内瑞拉马脑炎病毒的单域抗体的生成式深度学习设计

Antibodies (Basel). 2025 May 14;14(2):41. doi: 10.3390/antib14020041.

Token-Mol 1.0: tokenized drug design with large language models.Token-Mol 1.0：基于大语言模型的标记化药物设计

Nat Commun. 2025 May 13;16(1):4416. doi: 10.1038/s41467-025-59628-y.

An expandable synthetic library of human paired antibody sequences.一个可扩展的人类配对抗体序列合成文库。

PLoS Comput Biol. 2025 Apr 21;21(4):e1012932. doi: 10.1371/journal.pcbi.1012932. eCollection 2025 Apr.

Supervised fine-tuning of pre-trained antibody language models improves antigen specificity prediction.预训练抗体语言模型的监督微调可提高抗原特异性预测能力。

PLoS Comput Biol. 2025 Mar 31;21(3):e1012153. doi: 10.1371/journal.pcbi.1012153. eCollection 2025 Mar.

本文引用的文献

ProGen2: Exploring the boundaries of protein language models.ProGen2：探索蛋白质语言模型的边界。

Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.

Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies.基于大规模天然抗体数据集的深度学习实现快速、准确的抗体结构预测。

Nat Commun. 2023 Apr 25;14(1):2389. doi: 10.1038/s41467-023-38063-x.

Efficient evolution of human antibodies from general protein language models.从通用蛋白质语言模型中高效进化出人类抗体。

Nat Biotechnol. 2024 Feb;42(2):275-283. doi: 10.1038/s41587-023-01763-2. Epub 2023 Apr 24.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Large language models generate functional protein sequences across diverse families.大型语言模型可生成不同家族的功能性蛋白质序列。

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

AbLang: an antibody language model for completing antibody sequences.AbLang：一种用于完成抗体序列的抗体语言模型。

Bioinform Adv. 2022 Jun 17;2(1):vbac046. doi: 10.1093/bioadv/vbac046. eCollection 2022.

ProtGPT2 is a deep unsupervised language model for protein design.ProtGPT2 是一个用于蛋白质设计的深度无监督语言模型。

Nat Commun. 2022 Jul 27;13(1):4348. doi: 10.1038/s41467-022-32007-7.

Deciphering the language of antibodies using self-supervised learning.利用自监督学习破解抗体语言。

Patterns (N Y). 2022 May 18;3(7):100513. doi: 10.1016/j.patter.2022.100513. eCollection 2022 Jul 8.

In silico proof of principle of machine learning-based antibody design at unconstrained scale.基于机器学习的抗体设计无约束尺度的计算机原理证明。

MAbs. 2022 Jan-Dec;14(1):2031482. doi: 10.1080/19420862.2022.2031482.

BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning.BioPhi：一个基于天然抗体库和深度学习的抗体设计、人源化和人源评估平台。

MAbs. 2022 Jan-Dec;14(1):2020203. doi: 10.1080/19420862.2021.2020203.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验