Almubarak Hannah Faisal, Tan Wuwei, Hoffmann Andrew D, Sun Yuanfei, Wei Juncheng, El-Shennawy Lamiaa, Squires Joshua R, Dashzeveg Nurmaa K, Simonton Brooke, Jia Yuzhi, Iyer Radhika, Xu Yanan, Nicolaescu Vlad, Elli Derek, Randall Glenn C, Schipma Matthew J, Swaminathan Suchitra, Ison Michael G, Liu Huiping, Fang Deyu, Shen Yang
Department of Pharmacology, Northwestern University Feinberg School of Medicine, Chicago, IL, USA 60611.
Driskill Graduate Program, Northwestern University Feinberg School of Medicine, Chicago, IL, USA 60611.
bioRxiv. 2024 Aug 20:2024.03.01.582176. doi: 10.1101/2024.03.01.582176.
Therapeutic antibodies have become one of the most influential therapeutics in modern medicine to fight against infectious pathogens, cancer, and many other diseases. However, experimental screening for highly efficacious targeting antibodies is labor-intensive and of high cost, which is exacerbated by evolving antigen targets under selective pressure such as fast-mutating viral variants. As a proof-of-concept, we developed a machine learning-assisted antibody generation pipeline AbGen that greatly accelerates the screening and re-design of immunoglobulins G (IgGs) against a broad spectrum of SARS-CoV-2 coronavirus variant strains. Our AbGen centers around a novel antibody language model (AbLM) that is pretrained on 12 million generic protein domain sequences and fine-tuned on 4,000+ paired VH-VL sequences, with IgG-specific CDR-masking and VH-VL cross-attention. AbLM provides a latent space of IgG sequence embeddings for AbGen, including (a) landscapes of IgGs' activities in neutralizing the wild-type virus are analyzed through structure prediction for IgG and IgG-antigen (viral protein spike's receptor binding domain, RBD) interactions; and (b) landscapes of IgGs' susceptibility in neutralizing variant viruses are predicted through Gaussian process regression, despite that as few as 14 clinical antibodies' responses to variants of concern are available. The AbGen pipeline was applied to over 1300 IgG sequences we collected from RBD-binding B cells of convalescent patients. With experimental validations, AbGen efficiently prioritized IgG candidates against a broad spectrum of viral variants (wildtype, Delta, and Omicron), preventing the infection of host cells and hACE2 transgenic mice . Compared to other existing protein language models that require 10-100 times more model parameters, AbLM improved the precision from around 50% to 75% to predict IgGs with low variant susceptibility. Furthermore, AbGen enables structure-based computational protein redesign for selected IgG clones with single amino acid substitutions at the RBD-binding interface that doubled the IgG blockade efficacy for one of the severe, therapy-resistant strains - Delta (B.1.617). Our work expedites applications of artificial intelligence in antibody screen and re-design combining data-driven protein language models and Kriging for antibody sequence analysis and activity prediction, in synergy with physics-driven protein docking and design for antibody-antigen interface analyses and functional optimization.
治疗性抗体已成为现代医学中对抗感染性病原体、癌症和许多其他疾病最具影响力的治疗方法之一。然而,对高效靶向抗体进行实验筛选既耗费人力又成本高昂,而诸如快速变异的病毒变体等选择性压力下不断演变的抗原靶点更是加剧了这一问题。作为概念验证,我们开发了一种机器学习辅助的抗体生成流程AbGen,它能极大地加速针对广泛的严重急性呼吸综合征冠状病毒2(SARS-CoV-2)冠状病毒变异株的免疫球蛋白G(IgG)的筛选和重新设计。我们的AbGen围绕一种新型抗体语言模型(AbLM)构建,该模型在1200万个通用蛋白质结构域序列上进行预训练,并在4000多个配对的重链可变区(VH)-轻链可变区(VL)序列上进行微调,同时采用了IgG特异性互补决定区(CDR)掩码和VH-VL交叉注意力机制。AbLM为AbGen提供了IgG序列嵌入的潜在空间,包括:(a)通过对IgG和IgG-抗原(病毒蛋白刺突的受体结合域,RBD)相互作用进行结构预测,分析IgG中和野生型病毒的活性情况;(b)尽管仅有14种临床抗体对关注变异株的反应数据可用,但仍通过高斯过程回归预测IgG中和变异病毒的敏感性情况。AbGen流程应用于我们从康复患者的RBD结合B细胞中收集的1300多个IgG序列。通过实验验证,AbGen有效地对针对广泛病毒变体(野生型、德尔塔和奥密克戎)的IgG候选物进行了优先级排序,防止宿主细胞和人血管紧张素转换酶2(hACE2)转基因小鼠受到感染。与其他需要多10到100倍模型参数的现有蛋白质语言模型相比,AbLM将预测低变异敏感性IgG的精度从约50%提高到了75%。此外,AbGen能够对选定的IgG克隆进行基于结构的计算蛋白质重新设计,在RBD结合界面进行单氨基酸替换,这使得针对一种严重的、对治疗耐药的毒株——德尔塔(B.1.617)的IgG阻断效力提高了一倍。我们的工作加快了人工智能在抗体筛选和重新设计中的应用,将数据驱动的蛋白质语言模型和克里金法结合用于抗体序列分析和活性预测,同时与物理驱动的蛋白质对接和设计协同进行抗体-抗原界面分析和功能优化。