• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用简单自回归模型高效生成蛋白质序列。

Efficient generative modeling of protein sequences using simple autoregressive models.

机构信息

Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France.

Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France.

出版信息

Nat Commun. 2021 Oct 4;12(1):5800. doi: 10.1038/s41467-021-25756-4.

DOI:10.1038/s41467-021-25756-4
PMID:34608136
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8490405/
Abstract

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 10 and 10). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 10 possible sequences, which nevertheless constitute only the astronomically small fraction 10 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.

摘要

生成模型是一种很有前途的候选方法,可用于蛋白质设计中的新型序列数据驱动方法,也可用于从快速增长的序列数据库中提取蛋白质的结构和功能信息。在这里,我们提出了简单的自回归模型,作为高度准确但计算效率高的生成序列模型。我们表明,它们与基于玻尔兹曼机或深度生成模型的现有方法表现相似,但计算成本要低得多(低 10 到 10 倍)。此外,我们模型的简单结构具有独特的数学优势,这转化为在序列生成和评估中的更好适用性。在这些模型中,我们可以轻松估计给定序列的概率,并且可以使用模型的熵来评估与特定蛋白质家族相关的功能序列空间的大小。在响应调节剂的示例中,我们发现了大量大约 10 种可能的序列,但它们仅占相同长度的所有氨基酸序列的天文数字小部分 10。这些发现说明了通过生成序列模型探索序列空间的潜力和困难。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/8c64c30da8f2/41467_2021_25756_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/82a8647d824f/41467_2021_25756_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/00f084b9086a/41467_2021_25756_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/1579218d3a97/41467_2021_25756_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/8c64c30da8f2/41467_2021_25756_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/82a8647d824f/41467_2021_25756_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/00f084b9086a/41467_2021_25756_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/1579218d3a97/41467_2021_25756_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac1a/8490405/8c64c30da8f2/41467_2021_25756_Fig4_HTML.jpg

相似文献

1
Efficient generative modeling of protein sequences using simple autoregressive models.使用简单自回归模型高效生成蛋白质序列。
Nat Commun. 2021 Oct 4;12(1):5800. doi: 10.1038/s41467-021-25756-4.
2
The generative capacity of probabilistic protein sequence models.概率蛋白质序列模型的生成能力。
Nat Commun. 2021 Nov 2;12(1):6302. doi: 10.1038/s41467-021-26529-9.
3
ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design.ProtWave-VAE:用于数据驱动蛋白质设计的基于潜在信息的推断与自回归采样的整合。
ACS Synth Biol. 2023 Dec 15;12(12):3544-3561. doi: 10.1021/acssynbio.3c00261. Epub 2023 Nov 21.
4
Towards parsimonious generative modeling of RNA families.RNA 家族生成模型的简约化研究。
Nucleic Acids Res. 2024 Jun 10;52(10):5465-5477. doi: 10.1093/nar/gkae289.
5
Support vector training of protein alignment models.蛋白质比对模型的支持向量训练
J Comput Biol. 2008 Sep;15(7):867-80. doi: 10.1089/cmb.2007.0152.
6
The intrinsic dimension of protein sequence evolution.蛋白质序列进化的内蕴维数。
PLoS Comput Biol. 2019 Apr 8;15(4):e1006767. doi: 10.1371/journal.pcbi.1006767. eCollection 2019 Apr.
7
Euclidian space and grouping of biological objects.欧几里得空间与生物对象的分组
Bioinformatics. 2002 Nov;18(11):1523-34. doi: 10.1093/bioinformatics/18.11.1523.
8
Learning generative models for protein fold families.学习蛋白质折叠家族的生成模型。
Proteins. 2011 Apr;79(4):1061-78. doi: 10.1002/prot.22934. Epub 2011 Jan 25.
9
Generative power of a protein language model trained on multiple sequence alignments.基于多序列比对训练的蛋白质语言模型的生成能力。
Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.
10
Size and structure of the sequence space of repeat proteins.重复蛋白序列空间的大小和结构。
PLoS Comput Biol. 2019 Aug 15;15(8):e1007282. doi: 10.1371/journal.pcbi.1007282. eCollection 2019 Aug.

引用本文的文献

1
Design of highly functional genome editors by modelling CRISPR-Cas sequences.通过对CRISPR-Cas序列进行建模设计高功能基因组编辑器。
Nature. 2025 Jul 30. doi: 10.1038/s41586-025-09298-z.
2
Explainability of Protein Deep Learning Models.蛋白质深度学习模型的可解释性
Int J Mol Sci. 2025 May 29;26(11):5255. doi: 10.3390/ijms26115255.
3
Generative Artificial Intelligence for Virology.用于病毒学的生成式人工智能

本文引用的文献

1
Protein sequence design by conformational landscape optimization.通过构象景观优化进行蛋白质序列设计。
Proc Natl Acad Sci U S A. 2021 Mar 16;118(11). doi: 10.1073/pnas.2017228118.
2
Generating functional protein variants with variational autoencoders.利用变分自动编码器生成功能性蛋白质变体。
PLoS Comput Biol. 2021 Feb 26;17(2):e1008736. doi: 10.1371/journal.pcbi.1008736. eCollection 2021 Feb.
3
Remote homology search with hidden Potts models.使用隐式 Potts 模型进行远程同源搜索。
Methods Mol Biol. 2025;2927:195-220. doi: 10.1007/978-1-0716-4546-8_11.
4
Phylogenetic Corrections and Higher-Order Sequence Statistics in Protein Families: The Potts Model vs MSA Transformer.蛋白质家族中的系统发育校正和高阶序列统计:Potts模型与多序列比对变换器
ArXiv. 2025 Mar 1:arXiv:2503.00289v1.
5
Harnessing advanced computational approaches to design novel antimicrobial peptides against intracellular bacterial infections.利用先进的计算方法设计针对细胞内细菌感染的新型抗菌肽。
Bioact Mater. 2025 Apr 28;50:510-524. doi: 10.1016/j.bioactmat.2025.04.016. eCollection 2025 Aug.
6
PRESCOTT: a population aware, epistatic, and structural model accurately predicts missense effects.普雷斯科特:一种群体感知、上位性和结构模型能准确预测错义效应。
Genome Biol. 2025 May 6;26(1):113. doi: 10.1186/s13059-025-03581-y.
7
Reconstruction of Ancestral Protein Sequences Using Autoregressive Generative Models.使用自回归生成模型重建祖先蛋白质序列
Mol Biol Evol. 2025 Apr 1;42(4). doi: 10.1093/molbev/msaf070.
8
Direct coupling analysis and the attention mechanism.直接耦合分析与注意力机制。
BMC Bioinformatics. 2025 Feb 6;26(1):41. doi: 10.1186/s12859-025-06062-y.
9
Exploring Evolution to Uncover Insights Into Protein Mutational Stability.探索进化以揭示蛋白质突变稳定性的见解。
Mol Biol Evol. 2025 Jan 6;42(1). doi: 10.1093/molbev/msae267.
10
ProteinReDiff: Complex-based ligand-binding proteins redesign by equivariant diffusion-based generative models.ProteinReDiff:基于等变扩散生成模型的基于复合物的配体结合蛋白重新设计
Struct Dyn. 2024 Nov 25;11(6):064102. doi: 10.1063/4.0000271. eCollection 2024 Nov.
PLoS Comput Biol. 2020 Nov 30;16(11):e1008085. doi: 10.1371/journal.pcbi.1008085. eCollection 2020 Nov.
4
Fast and Flexible Protein Design Using Deep Graph Neural Networks.利用深度图神经网络实现快速灵活的蛋白质设计。
Cell Syst. 2020 Oct 21;11(4):402-411.e4. doi: 10.1016/j.cels.2020.08.016. Epub 2020 Sep 23.
5
An evolution-based model for designing chorismate mutase enzymes.一种基于进化的分支酸变位酶设计模型。
Science. 2020 Jul 24;369(6502):440-445. doi: 10.1126/science.aba3304.
6
Epistatic contributions promote the unification of incompatible models of neutral molecular evolution.上位效应对促进中性分子进化不相容模型的统一有贡献。
Proc Natl Acad Sci U S A. 2020 Mar 17;117(11):5873-5882. doi: 10.1073/pnas.1913071117. Epub 2020 Mar 2.
7
Deep Autoregressive Models for the Efficient Variational Simulation of Many-Body Quantum Systems.用于多体量子系统高效变分模拟的深度自回归模型
Phys Rev Lett. 2020 Jan 17;124(2):020503. doi: 10.1103/PhysRevLett.124.020503.
8
Improved protein structure prediction using potentials from deep learning.利用深度学习势进行蛋白质结构预测的改进。
Nature. 2020 Jan;577(7792):706-710. doi: 10.1038/s41586-019-1923-7. Epub 2020 Jan 15.
9
Improved protein structure prediction using predicted interresidue orientations.利用预测的残基间取向改进蛋白质结构预测。
Proc Natl Acad Sci U S A. 2020 Jan 21;117(3):1496-1503. doi: 10.1073/pnas.1914677117. Epub 2020 Jan 2.
10
Selection of sequence motifs and generative Hopfield-Potts models for protein families.蛋白质家族的序列基序选择和生成型 Hopfield-Potts 模型。
Phys Rev E. 2019 Sep;100(3-1):032128. doi: 10.1103/PhysRevE.100.032128.