• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

AbLang:一种用于完成抗体序列的抗体语言模型。

AbLang: an antibody language model for completing antibody sequences.

作者信息

Olsen Tobias H, Moal Iain H, Deane Charlotte M

机构信息

Department of Statistics, University of Oxford, Oxford OX1 3LB, UK.

GSK Medicines Research Centre, GlaxoSmithKline, Stevenage SG1 2NY, UK.

出版信息

Bioinform Adv. 2022 Jun 17;2(1):vbac046. doi: 10.1093/bioadv/vbac046. eCollection 2022.

DOI:10.1093/bioadv/vbac046
PMID:36699403
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9710568/
Abstract

MOTIVATION

General protein language models have been shown to summarize the semantics of protein sequences into representations that are useful for state-of-the-art predictive methods. However, for antibody specific problems, such as restoring residues lost due to sequencing errors, a model trained solely on antibodies may be more powerful. Antibodies are one of the few protein types where the volume of sequence data needed for such language models is available, e.g. in the Observed Antibody Space (OAS) database.

RESULTS

Here, we introduce AbLang, a language model trained on the antibody sequences in the OAS database. We demonstrate the power of AbLang by using it to restore missing residues in antibody sequence data, a key issue with B-cell receptor repertoire sequencing, e.g. over 40% of OAS sequences are missing the first 15 amino acids. AbLang restores the missing residues of antibody sequences better than using IMGT germlines or the general protein language model ESM-1b. Further, AbLang does not require knowledge of the germline of the antibody and is seven times faster than ESM-1b.

AVAILABILITY AND IMPLEMENTATION

AbLang is a python package available at https://github.com/oxpig/AbLang.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

动机

通用蛋白质语言模型已被证明能够将蛋白质序列的语义总结为对最先进的预测方法有用的表示形式。然而,对于抗体特定问题,例如恢复因测序错误而丢失的残基,仅在抗体上训练的模型可能更强大。抗体是少数几种有足够此类语言模型所需序列数据量的蛋白质类型之一,例如在观察到的抗体空间(OAS)数据库中。

结果

在此,我们介绍AbLang,一种在OAS数据库中的抗体序列上训练的语言模型。我们通过使用AbLang恢复抗体序列数据中缺失的残基来证明其强大功能,这是B细胞受体库测序中的一个关键问题,例如超过40%的OAS序列缺少前15个氨基酸。AbLang在恢复抗体序列缺失残基方面比使用IMGT种系或通用蛋白质语言模型ESM-1b表现更好。此外,AbLang不需要了解抗体的种系,并且比ESM-1b快7倍。

可用性和实现方式

AbLang是一个Python包,可在https://github.com/oxpig/AbLang获取。

补充信息

补充数据可在网上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/7e50c6dff73b/vbac046f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/9c6755b8bb7e/vbac046f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/1385a22b5b8d/vbac046f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/b53090cd6278/vbac046f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/16a673556c3a/vbac046f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/9c6e4d9e0db2/vbac046f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/7e50c6dff73b/vbac046f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/9c6755b8bb7e/vbac046f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/1385a22b5b8d/vbac046f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/b53090cd6278/vbac046f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/16a673556c3a/vbac046f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/9c6e4d9e0db2/vbac046f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1b2/9710568/7e50c6dff73b/vbac046f6.jpg

相似文献

1
AbLang: an antibody language model for completing antibody sequences.AbLang:一种用于完成抗体序列的抗体语言模型。
Bioinform Adv. 2022 Jun 17;2(1):vbac046. doi: 10.1093/bioadv/vbac046. eCollection 2022.
2
KA-Search, a method for rapid and exhaustive sequence identity search of known antibodies.KA-Search,一种用于快速、全面搜索已知抗体序列同一性的方法。
Sci Rep. 2023 Jul 18;13(1):11612. doi: 10.1038/s41598-023-38108-7.
3
Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences.观察到的抗体空间:一个多样化的数据库,包含经过清理、注释和翻译的未配对和配对抗体序列。
Protein Sci. 2022 Jan;31(1):141-146. doi: 10.1002/pro.4205. Epub 2021 Oct 29.
4
DLAB: deep learning methods for structure-based virtual screening of antibodies.DLAB:基于结构的抗体虚拟筛选的深度学习方法。
Bioinformatics. 2022 Jan 3;38(2):377-383. doi: 10.1093/bioinformatics/btab660.
5
Predicting protein-peptide binding residues via interpretable deep learning.通过可解释的深度学习预测蛋白质-肽结合残基
Bioinformatics. 2022 Jun 27;38(13):3351-3360. doi: 10.1093/bioinformatics/btac352.
6
Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models.使用基于多序列比对的语言模型增强对单突变诱导的蛋白质稳定性变化的预测。
Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae447.
7
LMPred: predicting antimicrobial peptides using pre-trained language models and deep learning.LMPred:使用预训练语言模型和深度学习预测抗菌肽
Bioinform Adv. 2022 Mar 31;2(1):vbac021. doi: 10.1093/bioadv/vbac021. eCollection 2022.
8
Exploiting pretrained biochemical language models for targeted drug design.利用预先训练的生化语言模型进行靶向药物设计。
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii155-ii161. doi: 10.1093/bioinformatics/btac482.
9
DeepRank-GNN-esm: a graph neural network for scoring protein-protein models using protein language model.DeepRank-GNN-esm:一种使用蛋白质语言模型对蛋白质-蛋白质模型进行评分的图神经网络。
Bioinform Adv. 2024 Jan 5;4(1):vbad191. doi: 10.1093/bioadv/vbad191. eCollection 2024.
10
BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning.BioPhi:一个基于天然抗体库和深度学习的抗体设计、人源化和人源评估平台。
MAbs. 2022 Jan-Dec;14(1):2020203. doi: 10.1080/19420862.2021.2020203.

引用本文的文献

1
Thrifty wide-context models of B cell receptor somatic hypermutation.B细胞受体体细胞超突变的节俭宽背景模型
Elife. 2025 Aug 29;14:RP105471. doi: 10.7554/eLife.105471.
2
SALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning.SALM:用于全面抗体表征学习的序列-结构预训练大语言模型。
Research (Wash D C). 2025 Aug 19;8:0721. doi: 10.34133/research.0721. eCollection 2025.
3
Protein language model pseudolikelihoods capture features of in vivo B cell selection and evolution.蛋白质语言模型伪似然性捕捉体内B细胞选择和进化的特征。

本文引用的文献

1
Different B cell subpopulations show distinct patterns in their IgH repertoire metrics.不同 B 细胞亚群的 IgH 受体库指标呈现出不同的模式。
Elife. 2021 Oct 18;10:e73111. doi: 10.7554/eLife.73111.
2
Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences.观察到的抗体空间:一个多样化的数据库,包含经过清理、注释和翻译的未配对和配对抗体序列。
Protein Sci. 2022 Jan;31(1):141-146. doi: 10.1002/pro.4205. Epub 2021 Oct 29.
3
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf418.
4
A Sitewise Model of Natural Selection on Individual Antibodies via a Transformer-Encoder.一种通过Transformer编码器对个体抗体进行自然选择的位点特异性模型。
Mol Biol Evol. 2025 Jul 30;42(8). doi: 10.1093/molbev/msaf186.
5
Artificial intelligence-driven computational methods for antibody design and optimization.用于抗体设计与优化的人工智能驱动的计算方法。
MAbs. 2025 Dec;17(1):2528902. doi: 10.1080/19420862.2025.2528902. Epub 2025 Jul 18.
6
Nucleotide context models outperform protein language models for predicting antibody affinity maturation.在预测抗体亲和力成熟方面,核苷酸上下文模型优于蛋白质语言模型。
bioRxiv. 2025 Jun 18:2025.06.16.659977. doi: 10.1101/2025.06.16.659977.
7
Progress and challenges for the application of machine learning for neglected tropical diseases.机器学习在 neglected tropical diseases 中的应用进展与挑战。 (注:“neglected tropical diseases”直译为“被忽视的热带病” )
F1000Res. 2025 May 20;12:287. doi: 10.12688/f1000research.129064.2. eCollection 2023.
8
Focused learning by antibody language models using preferential masking of non-templated regions.通过对非模板化区域进行优先掩码处理,利用抗体语言模型进行聚焦学习。
Patterns (N Y). 2025 Apr 25;6(6):101239. doi: 10.1016/j.patter.2025.101239. eCollection 2025 Jun 13.
9
Bio-Inspired Mamba for Antibody-Antigen Interaction Prediction.用于抗体 - 抗原相互作用预测的仿生曼巴算法
Biomolecules. 2025 May 26;15(6):764. doi: 10.3390/biom15060764.
10
Applying computational protein design to therapeutic antibody discovery - current state and perspectives.将计算蛋白质设计应用于治疗性抗体发现——现状与展望。
Front Immunol. 2025 May 22;16:1571371. doi: 10.3389/fimmu.2025.1571371. eCollection 2025.
生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
4
Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。
Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.
5
Deep sequencing of B cell receptor repertoire.B 细胞受体文库的深度测序。
BMB Rep. 2019 Sep;52(9):540-547. doi: 10.5483/BMBRep.2019.52.9.192.
6
Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires.观察到的抗体空间:用于挖掘下一代抗体库测序数据的资源。
J Immunol. 2018 Oct 15;201(8):2502-2509. doi: 10.4049/jimmunol.1800708. Epub 2018 Sep 14.
7
Clustering huge protein sequence sets in linear time.线性时间内的大规模蛋白质序列集聚类。
Nat Commun. 2018 Jun 29;9(1):2542. doi: 10.1038/s41467-018-04964-5.
8
Analyzing Immunoglobulin Repertoires.分析免疫球蛋白库
Front Immunol. 2018 Mar 14;9:462. doi: 10.3389/fimmu.2018.00462. eCollection 2018.
9
ANARCI: antigen receptor numbering and receptor classification.ANARCI:抗原受体编号与受体分类
Bioinformatics. 2016 Jan 15;32(2):298-300. doi: 10.1093/bioinformatics/btv552. Epub 2015 Sep 30.
10
Accuracy and quality of massively parallel DNA pyrosequencing.大规模平行DNA焦磷酸测序的准确性和质量
Genome Biol. 2007;8(7):R143. doi: 10.1186/gb-2007-8-7-r143.