Department of Pathology, University of Alabama, Birmingham, AL 35249.
National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894.
Proc Natl Acad Sci U S A. 2019 Feb 26;116(9):3636-3645. doi: 10.1073/pnas.1814684116. Epub 2019 Feb 7.
From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, -gram analysis, to probe the "proteome grammar"-that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of "protein languages" in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at ∼1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a "quasi-universal grammar" underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell.
从抽象的信息角度来看,蛋白质结构域类似于自然语言中的单词,其中单词的组合规则由语言规则或语法决定。蛋白质结构域也存在这样的规则,因为在进化过程中只有一小部分可能的结构域组合是可行的。我们采用了一种流行的语言学技术,-gram 分析,来探究“蛋白质组语法”,即生成各种蛋白质结构域架构的结构域组合规则。比较生命主要分支中“蛋白质语言”的复杂度度量表明,观察到的结构域架构与随机结构域组合之间的相对熵差异(信息增益)在进化中高度保守,接近普遍常数,约为 1.2 位。只有在两个主要的生物群中观察到与这个常数的实质性偏差:一组似乎简化到极限的古细菌,以及表现出极端复杂性的动物。我们还确定了代表细胞生命主要分支的特征 gram。这种分析的结果支持了基因组和自然语言之间的类比,并表明“准通用语法”是所有细胞生命领域结构域架构进化的基础。结构域架构的信息增益几乎普遍的值可能反映了维持功能细胞所需的信号处理的最小复杂度。