SISSA, Trieste, Italy.
DISAT, Politecnico di Torino, Torino, Italy.
PLoS Comput Biol. 2019 Apr 8;15(4):e1006767. doi: 10.1371/journal.pcbi.1006767. eCollection 2019 Apr.
It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.
众所周知,为了保持其结构和功能,蛋白质不能随机改变其序列,而只能通过优先在特定位置发生突变来改变。我们在这里通过计算属于一组蛋白质家族的序列的内在维数(ID)来定量研究蛋白质序列进化中允许的可变性量。ID 是衡量从给定序列开始进化可以采取的独立方向数的度量。我们发现,属于同一家族的序列的 ID 实际上是恒定的,而且在不同的家族中非常相似,其值在 6 到 12 之间。这些值明显小于氨基酸的原始数量,证实了不同位置突变之间相关性的重要性。然而,我们证明相关性不足以解释我们在蛋白质家族中观察到的 ID 值较小的原因。事实上,我们表明,通过最大熵模型生成的一组蛋白质序列的 ID,这种方法考虑了相关性,通常明显大于在天然蛋白质家族中观察到的值。我们进一步证明,要再现自然 ID 的一个关键因素是要考虑序列的系统发育。