van Ginneken Daphne, Samant Anamay, Daga-Krumins Karlis, Glänzer Wiona, Agrafiotis Andreas, Kladis Evgenios, Reddy Sai T, Yermanos Alexander
Center for Translational Immunology, University Medical Center Utrecht, Lundlaan 6, Utrecht 3584EA, The Netherlands.
Department of Biosystems Science and Engineering, ETH Zurich, Klingelbergstrasse 48, 4056 Basel, Switzerland.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf418.
B cell selection and evolution play crucial roles in dictating successful immune responses. Recent advancements in sequencing technologies and deep-learning strategies have paved the way for generating and exploiting an ever-growing wealth of antibody repertoire data. The self-supervised nature of protein language models (PLMs) has demonstrated the ability to learn complex representations of antibody sequences and has been leveraged for a wide range of applications including diagnostics, structural modeling, and antigen-specificity predictions. PLM-derived likelihoods have been used to improve antibody affinities in vitro, raising the question of whether PLMs can capture and predict features of B cell selection in vivo. Here, we explore how general and antibody-specific PLM-generated sequence pseudolikelihoods (SPs) relate to features of in vivo B cell selection such as expansion, isotype usage, and somatic hypermutation (SHM) at single-cell resolution. Our results demonstrate that the type of PLM and the region of the antibody input sequence significantly affect the generated SP. Contrary to previous in vitro reports, we observe a negative correlation between SPs and binding affinity, whereas repertoire features such as SHM and isotype usage were strongly correlated with SPs. By constructing evolutionary lineage trees of B cell clones from human and mouse repertoires, we observe that SHMs are routinely among the most likely mutations suggested by PLMs and that mutating residues have lower absolute likelihoods than conserved residues. Our findings highlight the potential of PLMs to predict features of antibody selection and further suggest their potential to assist in antibody discovery and engineering.
B细胞的选择和进化在决定成功的免疫反应中起着关键作用。测序技术和深度学习策略的最新进展为生成和利用日益丰富的抗体库数据铺平了道路。蛋白质语言模型(PLM)的自监督性质已证明其能够学习抗体序列的复杂表征,并已被用于广泛的应用,包括诊断、结构建模和抗原特异性预测。源自PLM的似然性已被用于在体外提高抗体亲和力,这就提出了一个问题,即PLM是否能够捕捉和预测体内B细胞选择的特征。在这里,我们探讨通用的和抗体特异性的PLM生成的序列伪似然性(SP)如何与体内B细胞选择的特征相关,如单细胞分辨率下的扩增、同种型使用和体细胞超突变(SHM)。我们的结果表明,PLM的类型和抗体输入序列的区域会显著影响生成的SP。与之前的体外报告相反,我们观察到SP与结合亲和力之间呈负相关,而SHM和同种型使用等库特征与SP密切相关。通过构建来自人类和小鼠库的B细胞克隆的进化谱系树,我们观察到SHM通常是PLM建议的最可能的突变之一,并且突变残基的绝对似然性低于保守残基。我们的发现突出了PLM预测抗体选择特征的潜力,并进一步表明它们在协助抗体发现和工程方面的潜力。