纳米抗体大型语言模型辅助文库构建：利用蛋白质大型语言模型构建纳米抗体文库

NanoAbLLaMA: construction of nanobody libraries with protein large language models.

作者信息

Wang Xin, Chen Haotian, Chen Bo, Liang Lixin, Mei Fengcheng, Huang Bingding

机构信息

College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China.

Chengdu NBbiolab. CO., LTD., SME Incubation Park, Chengdu, China.

出版信息

Front Chem. 2025 Feb 25;13:1545136. doi: 10.3389/fchem.2025.1545136. eCollection 2025.

DOI:10.3389/fchem.2025.1545136

PMID:40070407

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11893428/

Abstract

INTRODUCTION

Traditional methods for constructing synthetic nanobody libraries are labor-intensive and time-consuming. This study introduces a novel approach leveraging protein large language models (LLMs) to generate germline-specific nanobody sequences, enabling efficient library construction through statistical analysis.

METHODS

We developed NanoAbLLaMA, a protein LLM based on LLaMA2, fine-tuned using low-rank adaptation (LoRA) on 120,000 curated nanobody sequences. The model generates sequences conditioned on germlines (IGHV3-301 and IGHV3S5301). Training involved dataset preparation from SAbDab and experimental data, alignment with IMGT germline references, and structural validation using ImmuneBuilder and Foldseek.

RESULTS

NanoAbLLaMA achieved near-perfect germline generation accuracy (100% for IGHV3-301, 95.5% for IGHV3S5301). Structural evaluations demonstrated superior predicted Local Distance Difference Test (pLDDT) scores (90.32 ± 10.13) compared to IgLM (87.36 ± 11.17), with comparable TM-scores. Generated sequences exhibited fewer high-risk post-translational modification sites than IgLM. Statistical analysis of CDR regions confirmed diversity, particularly in CDR3, enabling the creation of synthetic libraries with high humanization (>99.9%) and low risk.

DISCUSSION

This work establishes a paradigm shift in nanobody library construction by integrating LLMs, significantly reducing time and resource demands. While NanoAbLLaMA excels in germline-specific generation, limitations include restricted germline coverage and framework flexibility. Future efforts should expand germline diversity and incorporate druggability metrics for clinical relevance. The model's code, data, and resources are publicly available to facilitate broader adoption.

摘要

引言

构建合成纳米抗体文库的传统方法既费力又耗时。本研究引入了一种利用蛋白质大语言模型（LLMs）生成种系特异性纳米抗体序列的新方法，通过统计分析实现高效文库构建。

方法

我们开发了基于LLaMA2的蛋白质大语言模型NanoAbLLaMA，并使用低秩适应（LoRA）在120,000条经过整理的纳米抗体序列上进行微调。该模型根据种系（IGHV3 - 301和IGHV3S5301）生成序列。训练包括从SAbDab和实验数据准备数据集，与IMGT种系参考进行比对，以及使用ImmuneBuilder和Foldseek进行结构验证。

结果

NanoAbLLaMA在种系生成准确性方面近乎完美（IGHV3 - 301为100%，IGHV3S5301为95.5%）。结构评估表明，与IgLM（87.36 ± 11.17）相比，其预测的局部距离差异测试（pLDDT）分数更高（90.32 ± 10.13），TM分数相当。生成的序列比IgLM具有更少的高风险翻译后修饰位点。对互补决定区（CDR）区域的统计分析证实了其多样性，特别是在CDR3中，能够创建具有高人源化（>99.9%）和低风险的合成文库。

讨论

这项工作通过整合大语言模型在纳米抗体文库构建方面实现了范式转变，显著减少了时间和资源需求。虽然NanoAbLLaMA在种系特异性生成方面表现出色，但其局限性包括种系覆盖范围有限和框架灵活性不足。未来的工作应扩大种系多样性并纳入可成药指标以提高临床相关性。该模型的代码、数据和资源已公开提供，以促进更广泛的应用。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

纳米抗体大型语言模型辅助文库构建：利用蛋白质大型语言模型构建纳米抗体文库

NanoAbLLaMA: construction of nanobody libraries with protein large language models.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

DISCUSSION

引言

方法

结果

讨论

相似文献

本文引用的文献

纳米抗体大型语言模型辅助文库构建：利用蛋白质大型语言模型构建纳米抗体文库

NanoAbLLaMA: construction of nanobody libraries with protein large language models.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

DISCUSSION

引言

方法

结果

讨论

相似文献

本文引用的文献