有多少蛋白质序列能折叠成给定结构？共进化分析。

How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis.

作者信息

Tian Pengfei, Best Robert B

机构信息

Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland.

出版信息

Biophys J. 2017 Oct 17;113(8):1719-1730. doi: 10.1016/j.bpj.2017.08.039.

DOI:10.1016/j.bpj.2017.08.039

PMID:29045866

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5647607/

Abstract

Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance.

摘要

量化蛋白质序列与结构之间的关系是理解蛋白质世界的关键。这种关系的一个基本衡量标准是能够折叠成目标蛋白质结构的氨基酸序列总数，即“序列容量”，它被认为是给定蛋白质折叠可设计程度的一个指标。尽管已经使用晶格模型和理论对序列容量进行了广泛研究，但目前缺乏对真实蛋白质结构的数值估计。在这项工作中，我们使用基于残基-残基协同进化的统计模型，定量估计了10种具有各种不同结构的蛋白质的序列容量，以捕捉来自同一蛋白质家族的序列变异。值得注意的是，我们发现即使对于最小的蛋白质折叠，如WW结构域，可折叠序列的数量也极其庞大，超过了阿伏伽德罗常数。与早期的理论工作一致，计算出的序列容量与蛋白质的大小，或者更好地说，与接触密度呈正相关。这使得可以根据给定蛋白质的结构大致预测其绝对序列容量。另一方面，相对序列容量，即通过可能序列总数归一化后，是一个极其微小的数字，并且与蛋白质长度呈强烈的负相关。因此，尽管较大的蛋白质可能有更多的可折叠序列，但找到它们会困难得多。最后，我们将CATH数据库中蛋白质的进化年龄与其由我们的模型预测的序列容量进行了关联。结果表明，在高可设计性和新折叠偶然出现的可能性这两个相互矛盾的要求之间存在权衡。

相似文献

How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis.有多少蛋白质序列能折叠成给定结构？共进化分析。

Biophys J. 2017 Oct 17;113(8):1719-1730. doi: 10.1016/j.bpj.2017.08.039.

Emergence of preferred structures in a simple model of protein folding.蛋白质折叠简单模型中偏好结构的出现。

Science. 1996 Aug 2;273(5275):666-9. doi: 10.1126/science.273.5275.666.

Super folds, networks, and barriers.超级褶皱、网络和屏障。

Proteins. 2012 Feb;80(2):463-70. doi: 10.1002/prot.23212. Epub 2011 Nov 17.

The designability of protein structures.蛋白质结构的可设计性。

J Mol Graph Model. 2001;19(1):157-67. doi: 10.1016/s1093-3263(00)00137-6.

Exploration of the relationship between topology and designability of conformations.探索构象的拓扑结构与可设计性之间的关系。

J Chem Phys. 2011 Jun 21;134(23):235101. doi: 10.1063/1.3596947.

Physical origins of protein superfamilies.蛋白质超家族的物理起源

J Mol Biol. 2006 Apr 7;357(4):1335-43. doi: 10.1016/j.jmb.2006.01.081. Epub 2006 Feb 6.

Evolutionary information for specifying a protein fold.用于确定蛋白质折叠的进化信息。

Nature. 2005 Sep 22;437(7058):512-8. doi: 10.1038/nature03991.

Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection.协同进化信息、蛋白质折叠景观与自然选择的热力学

Proc Natl Acad Sci U S A. 2014 Aug 26;111(34):12408-13. doi: 10.1073/pnas.1413575111. Epub 2014 Aug 11.

Size and structure of the sequence space of repeat proteins.重复蛋白序列空间的大小和结构。

PLoS Comput Biol. 2019 Aug 15;15(8):e1007282. doi: 10.1371/journal.pcbi.1007282. eCollection 2019 Aug.

Emergence of highly designable protein-backbone conformations in an off-lattice model.非晶格模型中高度可设计蛋白质主链构象的出现。

Proteins. 2002 Jun 1;47(4):506-12. doi: 10.1002/prot.10107.

引用本文的文献

A percolation theory analysis of continuous functional paths in protein sequence space affirms previous insights on the optimization of proteins for adaptability.对蛋白质序列空间中连续功能路径的渗流理论分析证实了先前关于蛋白质适应性优化的见解。

PLoS One. 2024 Dec 5;19(12):e0314929. doi: 10.1371/journal.pone.0314929. eCollection 2024.

Hierarchical Analysis of Protein Structures: From Secondary Structures to Protein Units and Domains.蛋白质结构的层次分析：从二级结构到蛋白质单元和结构域。

Methods Mol Biol. 2025;2870:357-370. doi: 10.1007/978-1-0716-4213-9_18.

A systematic analysis of regression models for protein engineering.蛋白质工程中回归模型的系统分析。

PLoS Comput Biol. 2024 May 3;20(5):e1012061. doi: 10.1371/journal.pcbi.1012061. eCollection 2024 May.

Selection pressures on evolution of ribonuclease H explored with rigorous free-energy-based design.利用严格的基于自由能的设计探索核糖核酸酶 H 进化的选择压力。

Proc Natl Acad Sci U S A. 2024 Jan 16;121(3):e2312029121. doi: 10.1073/pnas.2312029121. Epub 2024 Jan 9.

Fluid protein fold space and its implications.流体蛋白质折叠空间及其意义。

Bioessays. 2023 Sep;45(9):e2300057. doi: 10.1002/bies.202300057. Epub 2023 Jul 11.

Alternative Reading Frames are an Underappreciated Source of Protein Sequence Novelty.可变阅读框是蛋白质序列新颖性的一个未被充分认识的来源。

J Mol Evol. 2023 Oct;91(5):570-580. doi: 10.1007/s00239-023-10122-3. Epub 2023 Jun 16.

Design of a Protein with Improved Thermal Stability by an Evolution-Based Generative Model.基于进化生成模型设计热稳定性提高的蛋白质。

Angew Chem Int Ed Engl. 2022 Dec 12;61(50):e202202711. doi: 10.1002/anie.202202711. Epub 2022 Nov 16.

A review of visualisations of protein fold networks and their relationship with sequence and function.蛋白质折叠网络可视化及其与序列和功能关系的综述。

Biol Rev Camb Philos Soc. 2023 Feb;98(1):243-262. doi: 10.1111/brv.12905. Epub 2022 Oct 9.

Identification of novel functional mini-receptors by combinatorial screening of split-WW domains.通过对分裂型WW结构域进行组合筛选来鉴定新型功能性微型受体。

Chem Sci. 2022 Jul 14;13(31):9079-9090. doi: 10.1039/d2sc01078j. eCollection 2022 Aug 10.

Allosteric Inter-Domain Contacts in Bacterial Hsp70 Are Located in Regions That Avoid Insertion and Deletion Events.细菌 Hsp70 中的别构域间接触位于避免插入和缺失事件的区域。

Int J Mol Sci. 2022 Mar 3;23(5):2788. doi: 10.3390/ijms23052788.

本文引用的文献

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.基于超深度学习模型的蛋白质接触图从头精确预测

PLoS Comput Biol. 2017 Jan 5;13(1):e1005324. doi: 10.1371/journal.pcbi.1005324. eCollection 2017 Jan.

UniProt: the universal protein knowledgebase.通用蛋白质知识库：UniProt

Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169. doi: 10.1093/nar/gkw1099. Epub 2016 Nov 29.

Benchmarking Inverse Statistical Approaches for Protein Structure and Design with Exactly Solvable Models.使用精确可解模型对蛋白质结构和设计的逆统计方法进行基准测试。

PLoS Comput Biol. 2016 May 13;12(5):e1004889. doi: 10.1371/journal.pcbi.1004889. eCollection 2016 May.

A vocabulary of ancient peptides at the origin of folded proteins.折叠蛋白起源处的古代肽词汇表。

Elife. 2015 Dec 14;4:e09410. doi: 10.7554/eLife.09410.

Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1.β-内酰胺酶TEM-1的协同进化景观推断及突变的上下文依赖性

Mol Biol Evol. 2016 Jan;33(1):268-80. doi: 10.1093/molbev/msv211. Epub 2015 Oct 6.

Structure of a functional amyloid protein subunit computed using sequence variation.使用序列变异计算功能淀粉样蛋白亚基的结构。

J Am Chem Soc. 2015 Jan 14;137(1):22-5. doi: 10.1021/ja5093634. Epub 2014 Dec 22.

Biophysics of protein evolution and evolutionary protein biophysics.蛋白质进化的生物物理学与进化蛋白质生物物理学

J R Soc Interface. 2014 Nov 6;11(100):20140419. doi: 10.1098/rsif.2014.0419.

Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection.协同进化信息、蛋白质折叠景观与自然选择的热力学

Proc Natl Acad Sci U S A. 2014 Aug 26;111(34):12408-13. doi: 10.1073/pnas.1413575111. Epub 2014 Aug 11.

Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era.在序列和结构丰富的时代评估基于共进化的残基-残基接触预测的效用。

Proc Natl Acad Sci U S A. 2013 Sep 24;110(39):15674-9. doi: 10.1073/pnas.1314045110. Epub 2013 Sep 5.

Evolutionary biochemistry: revealing the historical and physical causes of protein properties.进化生物化学：揭示蛋白质性质的历史和物理原因。

Nat Rev Genet. 2013 Aug;14(8):559-71. doi: 10.1038/nrg3540.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验