Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Inhoffenstraße 7B, 38124 Braunschweig, Germany.
Int J Syst Evol Microbiol. 2014 Feb;64(Pt 2):352-356. doi: 10.1099/ijs.0.056994-0.
The G+C content of a genome is frequently used in taxonomic descriptions of species and genera. In the past it has been determined using conventional, indirect methods, but it is nowadays reasonable to calculate the DNA G+C content directly from the increasingly available and affordable genome sequences. The expected increase in accuracy, however, might alter the way in which the G+C content is used for drawing taxonomic conclusions. We here re-estimate the literature assumption that the G+C content can vary up to 3-5 % within species using genomic datasets. The resulting G+C content differences are compared with DNA-DNA hybridization (DDH) similarities calculated in silico using the GGDC web server, with 70% similarity as the gold standard threshold for species boundaries. The results indicate that the G+C content, if computed from genome sequences, varies no more than 1% within species. Statistical models based on larger differences alone can reject the hypothesis that two strains belong to the same species. Because DDH similarities between two non-type strains occur in the genomic datasets, we also examine to what extent and under which conditions such a similarity could be <70% even though the similarity of either strain to a type strain was ≥ 70%. In theory, their similarity could be as low as 50%, whereas empirical data suggest a boundary closer (but not identical) to 70%. However, it is shown that using a 50% boundary would not affect the conclusions regarding the DNA G+C content. Hence, we suggest that discrepancies between G+C content data provided in species descriptions on the one hand and those recalculated after genome sequencing on the other hand ≥ 1% are due to significant inaccuracies of the applied conventional methods and accordingly call for emendations of species descriptions.
基因组的 G+C 含量常用于物种和属的分类描述。过去,它是通过常规的间接方法来确定的,但现在可以合理地直接从越来越多的可用和负担得起的基因组序列中计算 DNA G+C 含量。然而,准确性的预期提高可能会改变 G+C 含量用于得出分类结论的方式。我们在这里重新估算了文献中假设的 G+C 含量,即在使用基因组数据集时,物种内可变化 3-5%。使用 GGDC 网络服务器在计算机上计算 DNA-DNA 杂交 (DDH) 的相似性,以 70%的相似性作为物种界限的金标准阈值,将得到的 G+C 含量差异与计算出的差异进行比较。结果表明,如果从基因组序列中计算 G+C 含量,则物种内的变化不超过 1%。仅基于较大差异的统计模型可以拒绝两个菌株属于同一物种的假设。由于两个非典型菌株之间的 DDH 相似性出现在基因组数据集中,因此我们还检查了在何种程度和条件下,即使两个菌株与一个典型菌株的相似性≥70%,这种相似性也可能<70%。从理论上讲,它们的相似性可能低至 50%,而经验数据表明,接近(但不完全相同)70%的边界。然而,事实证明,使用 50%的边界不会影响关于 DNA G+C 含量的结论。因此,我们建议,一方面在物种描述中提供的 G+C 含量数据与另一方面在基因组测序后重新计算的数据之间的差异≥1%,是由于应用常规方法的显著不准确,因此需要对物种描述进行修正。