Tian Pengfei, Best Robert B
Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland.
Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland.
Biophys J. 2017 Oct 17;113(8):1719-1730. doi: 10.1016/j.bpj.2017.08.039.
Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance.
量化蛋白质序列与结构之间的关系是理解蛋白质世界的关键。这种关系的一个基本衡量标准是能够折叠成目标蛋白质结构的氨基酸序列总数,即“序列容量”,它被认为是给定蛋白质折叠可设计程度的一个指标。尽管已经使用晶格模型和理论对序列容量进行了广泛研究,但目前缺乏对真实蛋白质结构的数值估计。在这项工作中,我们使用基于残基-残基协同进化的统计模型,定量估计了10种具有各种不同结构的蛋白质的序列容量,以捕捉来自同一蛋白质家族的序列变异。值得注意的是,我们发现即使对于最小的蛋白质折叠,如WW结构域,可折叠序列的数量也极其庞大,超过了阿伏伽德罗常数。与早期的理论工作一致,计算出的序列容量与蛋白质的大小,或者更好地说,与接触密度呈正相关。这使得可以根据给定蛋白质的结构大致预测其绝对序列容量。另一方面,相对序列容量,即通过可能序列总数归一化后,是一个极其微小的数字,并且与蛋白质长度呈强烈的负相关。因此,尽管较大的蛋白质可能有更多的可折叠序列,但找到它们会困难得多。最后,我们将CATH数据库中蛋白质的进化年龄与其由我们的模型预测的序列容量进行了关联。结果表明,在高可设计性和新折叠偶然出现的可能性这两个相互矛盾的要求之间存在权衡。