Wolf Y I, Rogozin I B, Kondrashov A S, Koonin E V
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
Genome Res. 2001 Mar;11(3):356-72. doi: 10.1101/gr.gr-1619r.
Gene order in prokaryotes is conserved to a much lesser extent than protein sequences. Only several operons, primarily those that code for physically interacting proteins, are conserved in all or most of the bacterial and archaeal genomes. Nevertheless, even the limited conservation of operon organization that is observed can provide valuable evolutionary and functional clues through multiple genome comparisons. A program for constructing gapped local alignments of conserved gene strings in two genomes was developed. The statistical significance of the local alignments was assessed using Monte Carlo simulations. Sets of local alignments were generated for all pairs of completely sequenced bacterial and archaeal genomes, and for each genome a template-anchored multiple alignment was constructed. In most pairwise genome comparisons, <10% of the genes in each genome belonged to conserved gene strings. When closely related pairs of species (i.e., two mycoplasmas) are excluded, the total coverage of genomes by conserved gene strings ranged from <5% for the cyanobacterium Synechocystis sp to 24% for the minimal genome of Mycoplasma genitalium, and 23% in Thermotoga maritima. The coverage of the archaeal genomes was only slightly lower than that of bacterial genomes. The majority of the conserved gene strings are known operons, with the ribosomal superoperon being the top-scoring string in most genome comparisons. However, in some of the bacterial-archaeal pairs, the superoperon is rearranged to the extent that other operons, primarily those subject to horizontal transfer, show the greatest level of conservation, such as the archaeal-type H+-ATPase operon or ABC-type transport cassettes. The level of gene order conservation among prokaryotic genomes was compared to the cooccurrence of genomes in clusters of orthologous genes (COGs) and to the conservation of protein sequences themselves. Only limited correlation was observed between these evolutionary variables. Gene order conservation shows a much lower variance than the cooccurrence of genomes in COGs, which indicates that intragenome homogenization via recombination occurs in evolution much faster than intergenome homogenization via horizontal gene transfer and lineage-specific gene loss. The potential of using template-anchored multiple-genome alignments for predicting functions of uncharacterized genes was quantitatively assessed. Functions were predicted or significantly clarified for approximately 90 COGs (approximately 4% of the total of 2414 analyzed COGs). The most significant predictions were obtained for the poorly characterized archaeal genomes; these include a previously uncharacterized restriction-modification system, a nuclease-helicase combination implicated in DNA repair, and the probable archaeal counterpart of the eukaryotic exosome. Multiple genome alignments are a resource for studies on operon rearrangement and disruption, which is central to our understanding of the evolution of prokaryotic genomes. Because of the rapid evolution of the gene order, the potential of genome alignment for prediction of gene functions is limited, but nevertheless, such predictions information significantly complements the results obtained through protein sequence and structure analysis.
原核生物中的基因顺序保守程度远低于蛋白质序列。只有少数操纵子,主要是那些编码相互作用蛋白质的操纵子,在所有或大多数细菌和古菌基因组中是保守的。然而,即使观察到的操纵子组织的有限保守性,也可以通过多个基因组比较提供有价值的进化和功能线索。开发了一个用于构建两个基因组中保守基因串的间隙局部比对的程序。使用蒙特卡罗模拟评估局部比对的统计显著性。为所有已完全测序的细菌和古菌基因组对生成局部比对集,并为每个基因组构建一个模板锚定的多序列比对。在大多数成对基因组比较中,每个基因组中<10%的基因属于保守基因串。当排除密切相关的物种对(即两种支原体)时,保守基因串对基因组的总覆盖率范围从蓝藻集胞藻属的<5%到生殖支原体最小基因组的24%,以及嗜热栖热菌的23%。古菌基因组的覆盖率仅略低于细菌基因组。大多数保守基因串是已知的操纵子,核糖体超级操纵子在大多数基因组比较中是得分最高的串。然而,在一些细菌-古菌对中,超级操纵子发生了重排,以至于其他操纵子,主要是那些经历水平转移的操纵子,显示出最高水平的保守性,如古菌型H+-ATP酶操纵子或ABC型转运盒。将原核生物基因组之间的基因顺序保守水平与直系同源基因簇(COG)中基因组的共现情况以及蛋白质序列本身的保守性进行了比较。在这些进化变量之间仅观察到有限的相关性。基因顺序保守性的方差远低于COG中基因组的共现情况,这表明通过重组进行的基因组内同质化在进化中发生的速度比通过水平基因转移和谱系特异性基因丢失进行的基因组间同质化快得多。定量评估了使用模板锚定的多基因组比对预测未表征基因功能的潜力。对大约90个COG(占分析的2414个COG总数的约4%)的功能进行了预测或显著阐明。对于特征描述不佳的古菌基因组获得了最显著的预测结果;这些包括一个以前未表征的限制修饰系统、一种与DNA修复有关的核酸酶-解旋酶组合,以及真核外泌体可能的古菌对应物。多基因组比对是研究操纵子重排和破坏的资源,这对于我们理解原核生物基因组的进化至关重要。由于基因顺序的快速进化,基因组比对预测基因功能的潜力有限,但尽管如此,此类预测信息显著补充了通过蛋白质序列和结构分析获得的结果。