Center for Bioinformatics and Computational Genomics and School of Biology, Georgia Institute of Technology, Atlanta, GA 30332-0512, USA.
ISME J. 2012 Apr;6(4):898-901. doi: 10.1038/ismej.2011.147. Epub 2011 Oct 27.
Assembling individual genomes from complex community metagenomic data remains a challenging issue for environmental studies. We evaluated the quality of genome assemblies from community short read data (Illumina 100 bp pair-ended sequences) using datasets recovered from freshwater and soil microbial communities as well as in silico simulations. Our analyses revealed that the genome of a single genotype (or species) can be accurately assembled from a complex metagenome when it shows at least about 20 × coverage. At lower coverage, however, the derived assemblies contained a substantial fraction of non-target sequences (chimeras), which explains, at least in part, the higher number of hypothetical genes recovered in metagenomic relative to genomic projects. We also provide examples of how to detect intrapopulation structure in metagenomic datasets and estimate the type and frequency of errors in assembled genes and contigs from datasets of varied species complexity.
从复杂的群落宏基因组数据中组装个体基因组仍然是环境研究中的一个具有挑战性的问题。我们使用从淡水和土壤微生物群落以及计算机模拟中恢复的数据集来评估来自群落短读数据(Illumina 100bp 双端序列)的基因组组装的质量。我们的分析表明,当单个基因型(或物种)的覆盖率至少达到 20×时,可以从复杂的宏基因组中准确地组装基因组。然而,在较低的覆盖率下,所得到的组装包含了大量非目标序列(嵌合体),这至少部分解释了在宏基因组相对于基因组项目中恢复的假设基因数量较多的原因。我们还提供了如何在宏基因组数据集中检测种群内结构的示例,并估计了来自不同物种复杂度数据集的组装基因和基因簇中的错误类型和频率。