School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.
Department of Microbiology, University of Innsbruck, Innsbruck, Tyrol, Austria.
Appl Environ Microbiol. 2021 Feb 26;87(6). doi: 10.1128/AEM.02593-20.
The recovery of metagenome-assembled genomes (MAGs) from metagenomic data has recently become a common task for microbial studies. The strengths and limitations of the underlying bioinformatics algorithms are well appreciated by now based on performance tests with mock data sets of known composition. However, these mock data sets do not capture the complexity and diversity often observed within natural populations, since their construction typically relies on only a single genome of a given organism. Further, it remains unclear if MAGs can recover population-variable genes (those shared by >10% but <90% of the members of the population) as efficiently as core genes (those shared by >90% of the members). To address these issues, we compared the gene variabilities of pathogenic isolates from eight diarrheal samples, for which the isolate was the causative agent, against their corresponding MAGs recovered from the companion metagenomic data set. Our analysis revealed that MAGs with completeness estimates near 95% captured only 77% of the population core genes and 50% of the variable genes, on average. Further, about 5% of the genes of these MAGs were conservatively identified as missing in the isolate and were of different (non-) taxonomic origin, suggesting errors at the genome-binning step, even though contamination estimates based on commonly used pipelines were only 1.5%. Therefore, the quality of MAGs may often be worse than estimated, and we offer examples of how to recognize and improve such MAGs to sufficient quality by (for instance) employing only contigs longer than 1,000 bp for binning. Metagenome assembly and the recovery of metagenome-assembled genomes (MAGs) have recently become common tasks for microbiome studies across environmental and clinical settings. However, the extent to which MAGs can capture the genes of the population they represent remains speculative. Current approaches to evaluating MAG quality are limited to the recovery and copy number of universal housekeeping genes, which represent a small fraction of the total genome, leaving the majority of the genome essentially inaccessible. If MAG quality in reality is lower than these approaches would estimate, this could have dramatic consequences for all downstream analyses and interpretations. In this study, we evaluated this issue using an approach that employed comparisons of the gene contents of MAGs to the gene contents of isolate genomes derived from the same sample. Further, our samples originated from a diarrhea case-control study, and thus, our results are relevant for recovering the virulence factors of pathogens from metagenomic data sets.
宏基因组组装基因组(MAG)的恢复最近已成为微生物研究的常见任务。基于具有已知组成的模拟数据集的性能测试,现在已经很好地了解了基础生物信息学算法的优缺点。但是,这些模拟数据集无法捕获自然种群中经常观察到的复杂性和多样性,因为它们的构建通常仅依赖于给定生物体的单个基因组。此外,尚不清楚 MAG 是否可以像核心基因(> 90%的成员共享)一样有效地恢复种群可变基因(> 10%但<90%的成员共享)。为了解决这些问题,我们比较了来自八个腹泻样本的致病性分离株的基因可变性,其中分离株是致病原因,而这些分离株是从配套宏基因组数据集中恢复的相应 MAG。我们的分析表明,完整性估计值接近 95%的 MAG 平均仅捕获了 77%的种群核心基因和 50%的可变基因。此外,这些 MAG 中的约 5%的基因被保守地鉴定为在分离株中缺失,并且具有不同(非)分类学起源,这表明在基因组分箱步骤中存在错误,尽管基于常用管道的污染估计值仅为 1.5%。因此,MAG 的质量通常可能比估计的要差,并且我们提供了一些示例,说明如何通过(例如)仅对分箱使用长度大于 1000bp 的连续统来识别和改善这种 MAG 的质量。宏基因组组装和宏基因组组装基因组(MAG)的恢复最近已成为环境和临床环境中微生物组研究的常见任务。但是,MAG 可以捕获其代表的种群的基因的程度仍然是推测性的。当前评估 MAG 质量的方法仅限于普遍的管家基因的恢复和拷贝数,这些基因仅代表总基因组的一小部分,而大部分基因组实际上无法访问。如果 MAG 的质量实际上低于这些方法的估计值,这可能会对所有下游分析和解释产生巨大影响。在这项研究中,我们使用一种方法来评估这个问题,该方法使用 MAG 的基因含量与从同一样本中获得的分离基因组的基因含量进行比较。此外,我们的样本源自腹泻病例对照研究,因此,我们的结果对于从宏基因组数据集恢复病原体的毒力因子是相关的。