The University of Queensland, School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, St Lucia, QLD, Australia.
Present address: Department of Microbiology, University of Innsbruck, Innsbruck, Austria.
Microb Genom. 2024 May;10(5). doi: 10.1099/mgen.0.001256.
The Genome Taxonomy Database (GTDB) provides a species to domain classification of publicly available genomes based on average nucleotide identity (ANI) (for species) and a concatenated gene phylogeny normalized by evolutionary rates (for genus to phylum), which has been widely adopted by the scientific community. Here, we use the Genome UNClutterer (GUNC) software to identify putatively contaminated genomes in GTDB release 07-RS207. We found that GUNC reported 35,723 genomes as putatively contaminated, comprising 11.25 % of the 317,542 genomes in GTDB release 07-RS207. To assess the impact of this high level of inferred contamination on the delineation of taxa, we created 'clean' versions of the 34,846 putatively contaminated bacterial genomes by removing the most contaminated half. For each clean half, we re-calculated the ANI and concatenated gene phylogeny and found that only 77 (0.22 %) of the genomes were not consistent with their original classification. We conclude that the delineation of taxa in GTDB is robust to the putative contamination detected by GUNC.
基因组分类数据库(GTDB)基于平均核苷酸同一性(ANI)(用于物种分类)和通过进化速率标准化的基因系统发育串联(用于属到门分类),为公开可用的基因组提供了从域到种的分类,该分类已被科学界广泛采用。在这里,我们使用基因组去芜存菁软件(GUNC)来识别 GTDB 版本 07-RS207 中可能被污染的基因组。我们发现,GUNC 报告了 35723 个疑似污染的基因组,占 GTDB 版本 07-RS207 中 317542 个基因组的 11.25%。为了评估这种高水平的推断污染对分类群划分的影响,我们通过去除污染最严重的一半,创建了 34846 个疑似污染细菌基因组的“清洁”版本。对于每个清洁的一半,我们重新计算了 ANI 和基因系统发育的串联,并发现只有 77 个(0.22%)基因组与其原始分类不一致。我们的结论是,GTDB 中的分类群划分对 GUNC 检测到的潜在污染具有鲁棒性。