• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于贝叶斯流网络的蛋白质序列建模。

Protein sequence modelling with Bayesian flow networks.

作者信息

Atkinson Timothy, Barrett Thomas D, Cameron Scott, Guloglu Bora, Greenig Matthew, Tan Charlie B, Robinson Louis, Graves Alex, Copoiu Liviu, Laterre Alexandre

机构信息

InstaDeep, 5 Merchant Square, London, W2 1AY, England.

出版信息

Nat Commun. 2025 Apr 3;16(1):3197. doi: 10.1038/s41467-025-58250-2.

DOI:10.1038/s41467-025-58250-2
PMID:40180946
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11968962/
Abstract

Exploring the vast and largely uncharted territory of amino acid sequences is crucial for understanding complex protein functions and the engineering of novel therapeutic proteins. Whilst generative machine learning has advanced protein sequence modelling, no existing approach is proficient in both unconditional and conditional generation. In this work, we propose that Bayesian Flow Networks (BFNs), a recently introduced framework for generative modelling, can address these challenges. We present ProtBFN, a 650M parameter model trained on protein sequences curated from UniProtKB, which generates natural-like, diverse, structurally coherent, and novel protein sequences, significantly outperforming leading autoregressive and discrete diffusion models. Further, we fine-tune ProtBFN on heavy chains from the Observed Antibody Space to obtain an antibody-specific model, AbBFN, which we use to evaluate zero-shot conditional generation capabilities. AbBFN is found to be competitive with or better than antibody-specific BERT-style models when applied to predicting individual framework or complimentary determining regions.

摘要

探索氨基酸序列这一广阔且大多未被描绘的领域对于理解复杂的蛋白质功能以及新型治疗性蛋白质的工程设计至关重要。虽然生成式机器学习推动了蛋白质序列建模的发展,但现有的方法在无条件生成和条件生成方面都不够精通。在这项工作中,我们提出贝叶斯流网络(BFN),这是一种最近引入的生成建模框架,可以应对这些挑战。我们展示了ProtBFN,这是一个在从UniProtKB精心挑选的蛋白质序列上训练的6.5亿参数模型,它能生成自然、多样、结构连贯且新颖的蛋白质序列,显著优于领先的自回归模型和离散扩散模型。此外,我们在观察到的抗体空间的重链上对ProtBFN进行微调,以获得一个抗体特异性模型AbBFN,我们用它来评估零样本条件生成能力。当应用于预测单个框架或互补决定区时,发现AbBFN与抗体特异性的BERT风格模型具有竞争力或更优。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/087a7ecbea04/41467_2025_58250_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/9c009f8e50a4/41467_2025_58250_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/4e7f13e95000/41467_2025_58250_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/561dfe9b3bcf/41467_2025_58250_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/8d6c42f1360d/41467_2025_58250_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/7b5980086c47/41467_2025_58250_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/087a7ecbea04/41467_2025_58250_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/9c009f8e50a4/41467_2025_58250_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/4e7f13e95000/41467_2025_58250_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/561dfe9b3bcf/41467_2025_58250_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/8d6c42f1360d/41467_2025_58250_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/7b5980086c47/41467_2025_58250_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c39e/11968962/087a7ecbea04/41467_2025_58250_Fig6_HTML.jpg

相似文献

1
Protein sequence modelling with Bayesian flow networks.基于贝叶斯流网络的蛋白质序列建模。
Nat Commun. 2025 Apr 3;16(1):3197. doi: 10.1038/s41467-025-58250-2.
2
Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences.利用蛋白质序列的物理化学性质进行泛素化位点预测的计算方法。
BMC Bioinformatics. 2016 Mar 3;17:116. doi: 10.1186/s12859-016-0959-z.
3
BioBayesNet: a web server for feature extraction and Bayesian network modeling of biological sequence data.BioBayesNet:用于生物序列数据特征提取和贝叶斯网络建模的网络服务器。
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W688-93. doi: 10.1093/nar/gkm292. Epub 2007 May 30.
4
Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach.2007年的选择:使用贝叶斯推理方法检测正向选择和净化选择的先进模型。
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W506-11. doi: 10.1093/nar/gkm382. Epub 2007 Jun 22.
5
Generating functional protein variants with variational autoencoders.利用变分自动编码器生成功能性蛋白质变体。
PLoS Comput Biol. 2021 Feb 26;17(2):e1008736. doi: 10.1371/journal.pcbi.1008736. eCollection 2021 Feb.
6
Learned protein embeddings for machine learning.机器学习的深度学习蛋白质嵌入。
Bioinformatics. 2018 Aug 1;34(15):2642-2648. doi: 10.1093/bioinformatics/bty178.
7
Efficient generative modeling of protein sequences using simple autoregressive models.使用简单自回归模型高效生成蛋白质序列。
Nat Commun. 2021 Oct 4;12(1):5800. doi: 10.1038/s41467-021-25756-4.
8
Boosting phosphorylation site prediction with sequence feature-based machine learning.基于序列特征的机器学习提高磷酸化位点预测。
Proteins. 2020 Feb;88(2):284-291. doi: 10.1002/prot.25801. Epub 2019 Aug 22.
9
Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable.使用机器学习算法将二元蛋白质序列分类为高度可设计或低可设计。
BMC Bioinformatics. 2008 Nov 18;9:487. doi: 10.1186/1471-2105-9-487.
10
Classifying noisy protein sequence data: a case study of immunoglobulin light chains.对有噪声的蛋白质序列数据进行分类:以免疫球蛋白轻链为例
Bioinformatics. 2005 Jun;21 Suppl 1:i495-501. doi: 10.1093/bioinformatics/bti1024.

引用本文的文献

1
ProDualNet: dual-target protein sequence design method based on protein language model and structure model.ProDualNet:基于蛋白质语言模型和结构模型的双靶点蛋白质序列设计方法。
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf391.

本文引用的文献

1
Accurate structure prediction of biomolecular interactions with AlphaFold 3.利用 AlphaFold 3 进行生物分子相互作用的精确结构预测。
Nature. 2024 Jun;630(8016):493-500. doi: 10.1038/s41586-024-07487-w. Epub 2024 May 8.
2
Merizo: a rapid and accurate protein domain segmentation method using invariant point attention.Merizo:一种使用不变点注意力的快速准确的蛋白质结构域分割方法。
Nat Commun. 2023 Dec 19;14(1):8445. doi: 10.1038/s41467-023-43934-4.
3
Large-scale clustering of AlphaFold2 3D models shines light on the structure and function of proteins.
AlphaFold2三维模型的大规模聚类揭示了蛋白质的结构和功能。
Mol Cell. 2023 Nov 16;83(22):3950-3952. doi: 10.1016/j.molcel.2023.10.039.
4
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
5
VHH CDR-H3 conformation is determined by VH germline usage.VHH CDR-H3 构象由 VH 胚系使用决定。
Commun Biol. 2023 Aug 19;6(1):864. doi: 10.1038/s42003-023-05241-y.
6
De novo design of protein structure and function with RFdiffusion.利用 RFdiffusion 从头设计蛋白质结构和功能。
Nature. 2023 Aug;620(7976):1089-1100. doi: 10.1038/s41586-023-06415-8. Epub 2023 Jul 11.
7
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
8
ProteinGLUE multi-task benchmark suite for self-supervised protein modeling.蛋白质 GLUE 多任务基准套件,用于自监督蛋白质建模。
Sci Rep. 2022 Sep 26;12(1):16047. doi: 10.1038/s41598-022-19608-4.
9
ProtGPT2 is a deep unsupervised language model for protein design.ProtGPT2 是一个用于蛋白质设计的深度无监督语言模型。
Nat Commun. 2022 Jul 27;13(1):4348. doi: 10.1038/s41467-022-32007-7.
10
NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning.NetSurfP-3.0:通过蛋白质语言模型和深度学习实现蛋白质结构特征的准确快速预测。
Nucleic Acids Res. 2022 Jul 5;50(W1):W510-W515. doi: 10.1093/nar/gkac439.