Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, IL 61801, USA.
Proc Natl Acad Sci U S A. 2011 Jul 19;108(29):11954-8. doi: 10.1073/pnas.1017361108. Epub 2011 Jul 5.
The lengths of orthologous protein families in Eukarya are almost double the lengths found in Bacteria and Archaea. Here we examine protein structures in 745 genomes and show that protein length differences between superkingdoms arise as much shorter prokaryotic nondomain linker sequences. Eukaryotic, bacterial, and archaeal linkers are 250, 86, and 73 aa residues in length, respectively, whereas folded domain sequences are 281, 280, and 256 residues, respectively. Cryptic domains match linkers (P < 0.0001) with probabilities ranging between 0.022 and 0.042; accordingly, they do not affect length estimates significantly. Linker sequences support intermolecular binding within proteomes and they are probably enriched in intrinsically disordered regions as well. Reductively evolved linker sequence lengths in growth rate maximized cells should be proportional to proteome diversity. By using total in-frame coding capacity of a genome [i.e., coding sequence (CDS)] as a reliable measure of proteome diversity, we find linker lengths of prokaryotes clearly evolve in proportion to CDS values, whereas those of eukaryotes are more randomly larger than expected. Domain lengths scarcely change over the entire range of CDS values. Thus, the protein linkers of prokaryotes evolve reductively whereas those of eukaryotes do not.
真核生物同源蛋白家族的长度几乎是细菌和古菌的两倍。在这里,我们检查了 745 个基因组中的蛋白质结构,表明超级生物之间的蛋白质长度差异主要来自更短的原核非结构域连接序列。真核生物、细菌和古菌的连接序列分别为 250、86 和 73 个氨基酸残基,而折叠的结构域序列分别为 281、280 和 256 个氨基酸残基。隐藏结构域与连接序列(P < 0.0001)匹配的概率在 0.022 到 0.042 之间;因此,它们不会显著影响长度估计。连接序列支持蛋白质组内的分子间结合,并且它们可能富含内在无序区域。在生长速度最大化的细胞中,还原进化的连接序列长度应该与蛋白质组多样性成比例。通过使用基因组的总框架内编码能力(即编码序列[CDS])作为蛋白质组多样性的可靠度量,我们发现原核生物的连接序列长度显然与 CDS 值成比例进化,而真核生物的连接序列长度则比预期更随机地更大。在整个 CDS 值范围内,结构域长度几乎没有变化。因此,原核生物的蛋白质连接序列是还原进化的,而真核生物的则不是。