Goussarov Gleb, Claesen Jürgen, Mysara Mohamed, Cleenwerck Ilse, Leys Natalie, Vandamme Peter, Van Houdt Rob
Microbiology Unit, Belgian Nuclear Research Centre (SCK CEN), Mol, Belgium.
Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Faculty of Sciences, Ghent University, Ghent, Belgium.
Environ Microbiome. 2022 Mar 5;17(1):9. doi: 10.1186/s40793-022-00403-7.
Although the total number of microbial taxa on Earth is under debate, it is clear that only a small fraction of these has been cultivated and validly named. Evidently, the inability to culture most bacteria outside of very specific conditions severely limits their characterization and further studies. In the last decade, a major part of the solution to this problem has been the use of metagenome sequencing, whereby the DNA of an entire microbial community is sequenced, followed by the in silico reconstruction of genomes of its novel component species. The large discrepancy between the number of sequenced type strain genomes (around 12,000) and total microbial diversity (10-10 species) directs these efforts to de novo assembly and binning. Unfortunately, these steps are error-prone and as such, the results have to be intensely scrutinized to avoid publishing incomplete and low-quality genomes.
We developed MAGISTA (metagenome-assembled genome intra-bin statistics assessment), a novel approach to assess metagenome-assembled genome quality that tackles some of the often-neglected drawbacks of current reference gene-based methods. MAGISTA is based on alignment-free distance distributions between contig fragments within metagenomic bins, rather than a set of reference genes. For proper training, a highly complex genomic DNA mock community was needed and constructed by pooling genomic DNA of 227 bacterial strains, specifically selected to obtain a wide variety representing the major phylogenetic lineages of cultivable bacteria.
MAGISTA achieved a 20% reduction in root-mean-square error in comparison to the marker gene approach when tested on publicly available mock metagenomes. Furthermore, our highly complex genomic DNA mock community is a very valuable tool for benchmarking (new) metagenome analysis methods.
尽管地球上微生物分类群的总数仍存在争议,但很明显,其中只有一小部分已被培养并有效命名。显然,在非常特殊的条件之外无法培养大多数细菌,这严重限制了对它们的表征和进一步研究。在过去十年中,解决这个问题的主要方法是使用宏基因组测序,即对整个微生物群落的DNA进行测序,然后在计算机上重建其新组成物种的基因组。已测序的模式菌株基因组数量(约12000个)与总微生物多样性(10-10种)之间的巨大差异促使人们致力于从头组装和分箱。不幸的是,这些步骤容易出错,因此,必须对结果进行严格审查,以避免发表不完整和低质量的基因组。
我们开发了MAGISTA(宏基因组组装基因组箱内统计评估),这是一种评估宏基因组组装基因组质量的新方法,它解决了当前基于参考基因的方法中一些经常被忽视的缺点。MAGISTA基于宏基因组箱内重叠群片段之间的无比对距离分布,而不是一组参考基因。为了进行适当的训练,需要一个高度复杂的基因组DNA模拟群落,并通过汇集227种细菌菌株的基因组DNA来构建,这些菌株经过专门挑选,以获得广泛代表可培养细菌主要系统发育谱系的样本。
在公开可用的模拟宏基因组上进行测试时,与标记基因方法相比,MAGISTA的均方根误差降低了20%。此外,我们高度复杂的基因组DNA模拟群落是用于基准测试(新的)宏基因组分析方法的非常有价值的工具。