Wang Xin, Chen Haotian, Chen Bo, Liang Lixin, Mei Fengcheng, Huang Bingding
College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China.
Chengdu NBbiolab. CO., LTD., SME Incubation Park, Chengdu, China.
Front Chem. 2025 Feb 25;13:1545136. doi: 10.3389/fchem.2025.1545136. eCollection 2025.
Traditional methods for constructing synthetic nanobody libraries are labor-intensive and time-consuming. This study introduces a novel approach leveraging protein large language models (LLMs) to generate germline-specific nanobody sequences, enabling efficient library construction through statistical analysis.
We developed NanoAbLLaMA, a protein LLM based on LLaMA2, fine-tuned using low-rank adaptation (LoRA) on 120,000 curated nanobody sequences. The model generates sequences conditioned on germlines (IGHV3-301 and IGHV3S5301). Training involved dataset preparation from SAbDab and experimental data, alignment with IMGT germline references, and structural validation using ImmuneBuilder and Foldseek.
NanoAbLLaMA achieved near-perfect germline generation accuracy (100% for IGHV3-301, 95.5% for IGHV3S5301). Structural evaluations demonstrated superior predicted Local Distance Difference Test (pLDDT) scores (90.32 ± 10.13) compared to IgLM (87.36 ± 11.17), with comparable TM-scores. Generated sequences exhibited fewer high-risk post-translational modification sites than IgLM. Statistical analysis of CDR regions confirmed diversity, particularly in CDR3, enabling the creation of synthetic libraries with high humanization (>99.9%) and low risk.
This work establishes a paradigm shift in nanobody library construction by integrating LLMs, significantly reducing time and resource demands. While NanoAbLLaMA excels in germline-specific generation, limitations include restricted germline coverage and framework flexibility. Future efforts should expand germline diversity and incorporate druggability metrics for clinical relevance. The model's code, data, and resources are publicly available to facilitate broader adoption.
构建合成纳米抗体文库的传统方法既费力又耗时。本研究引入了一种利用蛋白质大语言模型(LLMs)生成种系特异性纳米抗体序列的新方法,通过统计分析实现高效文库构建。
我们开发了基于LLaMA2的蛋白质大语言模型NanoAbLLaMA,并使用低秩适应(LoRA)在120,000条经过整理的纳米抗体序列上进行微调。该模型根据种系(IGHV3 - 301和IGHV3S5301)生成序列。训练包括从SAbDab和实验数据准备数据集,与IMGT种系参考进行比对,以及使用ImmuneBuilder和Foldseek进行结构验证。
NanoAbLLaMA在种系生成准确性方面近乎完美(IGHV3 - 301为100%,IGHV3S5301为95.5%)。结构评估表明,与IgLM(87.36 ± 11.17)相比,其预测的局部距离差异测试(pLDDT)分数更高(90.32 ± 10.13),TM分数相当。生成的序列比IgLM具有更少的高风险翻译后修饰位点。对互补决定区(CDR)区域的统计分析证实了其多样性,特别是在CDR3中,能够创建具有高人源化(>99.9%)和低风险的合成文库。
这项工作通过整合大语言模型在纳米抗体文库构建方面实现了范式转变,显著减少了时间和资源需求。虽然NanoAbLLaMA在种系特异性生成方面表现出色,但其局限性包括种系覆盖范围有限和框架灵活性不足。未来的工作应扩大种系多样性并纳入可成药指标以提高临床相关性。该模型的代码、数据和资源已公开提供,以促进更广泛的应用。