Ng Karenna, Briney Bryan
Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA.
Center for Viral Systems Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.
Patterns (N Y). 2025 Apr 25;6(6):101239. doi: 10.1016/j.patter.2025.101239. eCollection 2025 Jun 13.
Existing antibody language models (AbLMs) are pre-trained using a masked language modeling (MLM) objective with uniform masking probabilities. While these models excel at predicting germline residues, they often struggle with mutated and non-templated residues, which concentrate in the complementarity-determining regions (CDRs) and are crucial for antigen binding specificity. Here, we demonstrate that preferential masking of the primarily non-templated CDR3 is a compute-efficient strategy to enhance model performance. We pre-trained two AbLMs using either uniform or preferential masking and observed that the latter improves residue prediction accuracy in the highly variable CDR3. Preferential masking also improves antibody classification by native chain pairing and binding specificity, suggesting improved CDR3 understanding and indicating that non-random, learnable patterns help govern antibody chain pairing. We further show that specificity classification is largely informed by residues in the CDRs, demonstrating that AbLMs learn meaningful patterns that align with immunological understanding.
现有的抗体语言模型(AbLMs)使用具有均匀掩码概率的掩码语言建模(MLM)目标进行预训练。虽然这些模型在预测种系残基方面表现出色,但它们在处理突变和非模板化残基时往往存在困难,这些残基集中在互补决定区(CDR),对抗原结合特异性至关重要。在这里,我们证明优先掩码主要非模板化的CDR3是一种提高模型性能的计算高效策略。我们使用均匀掩码或优先掩码预训练了两个AbLMs,观察到后者提高了高度可变的CDR3中残基预测的准确性。优先掩码还通过天然链配对和结合特异性改善了抗体分类,表明对CDR3的理解有所改善,并表明非随机的、可学习的模式有助于控制抗体链配对。我们进一步表明,特异性分类在很大程度上由CDR中的残基决定,这表明AbLMs学习到了与免疫学理解一致的有意义模式。