Fournier Pierre-Edouard, Suhre Karsten, Fournous Ghislain, Raoult Didier
Information Génomique et Structurale, CNRS UPR2589, Case 934, 163 Avenue de Luminy, 13288 Marseille cedex 09, France.
Unité des rickettsies, IFR 48, CNRS UMR 6020, Faculté de Médecine, Université de la Méditerranée, 27 Boulevard Jean Moulin, 13385 Marseille cedex 05, France.
Int J Syst Evol Microbiol. 2006 May;56(Pt 5):1025-1029. doi: 10.1099/ijs.0.63903-0.
Determination of the DNA G+C content of prokaryotic genomes using traditional methods is time-consuming and results may vary from laboratory to laboratory, depending on the technique used. We explored the possibility of extrapolating the genomic DNA G+C content of prokaryotes from gene sequences. For this, 127 universally conserved genes were studied from 50 prokaryotic genomes in the Clusters of Orthologous Groups database. Of these, 57 genes were present as a single copy in the genomes of 157 different prokaryote species available in GenBank. There was a strong correlation [coefficient of determination (r2) >95 %] between the DNA G+C contents of 20 genes and their corresponding genomes. For each of the 157 prokaryotic genomes studied, the DNA G+C content of the 20 genes was used to determine a 'calculated' genome DNA G+C content (CGC) and this value was compared with the 'real' genome DNA G+C content (RGC). In order to select the most suitable gene for the determination of CGC values, we compared the r2 and median mol% difference between CGC and RGC as well as the sensitivity of each gene to provide CGC values for prokaryotic genomes that differ by less than 5 mol% from their RGC. The highly conserved ftsY gene (median size 1144 nucleotides), a vertically inherited member of the GTPase superfamily, showed the highest r2 value of 0.98, the smallest median mol% difference between CGC and RGC of 1.06 and a sensitivity of 100 %. Using ftsY DNA G+C content values, the CGC values of 100 genomes not included in the calculation of r2 differed by less than 5 mol% from their RGC values. These data suggest that the genomic DNA G+C content of prokaryotes may be estimated easily and reliably from the ftsY gene sequence.
使用传统方法测定原核生物基因组的DNA G+C含量既耗时,而且结果可能因所使用的技术不同而在不同实验室之间存在差异。我们探索了从基因序列推断原核生物基因组DNA G+C含量的可能性。为此,我们从直系同源簇数据库中的50个原核生物基因组中研究了127个普遍保守的基因。其中,57个基因在GenBank中157种不同原核生物物种的基因组中以单拷贝形式存在。20个基因的DNA G+C含量与其相应基因组之间存在很强的相关性[决定系数(r2)>95%]。对于所研究的157个原核生物基因组中的每一个,使用这20个基因的DNA G+C含量来确定一个“计算得到的”基因组DNA G+C含量(CGC),并将该值与“真实的”基因组DNA G+C含量(RGC)进行比较。为了选择最适合用于确定CGC值的基因,我们比较了CGC和RGC之间的r2和中位摩尔百分比差异,以及每个基因对于与RGC相差小于5摩尔百分比的原核生物基因组提供CGC值的敏感性。高度保守的ftsY基因(中位大小为1144个核苷酸)是GTPase超家族的垂直遗传成员,其r2值最高,为0.98,CGC和RGC之间的中位摩尔百分比差异最小,为1.06,敏感性为100%。使用ftsY DNA G+C含量值,在r2计算中未包含的100个基因组的CGC值与其RGC值相差小于5摩尔百分比。这些数据表明,从ftsY基因序列可以轻松且可靠地估计原核生物的基因组DNA G+C含量。