Computer Science Department, George Mason University, Fairfax, Virginia, USA.
BMC Genomics. 2011;12 Suppl 2(Suppl 2):S8. doi: 10.1186/1471-2164-12-S2-S8. Epub 2011 Jul 27.
Metagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms. The problem becomes even more difficult when short reads produced by next generation sequencing technologies are used. Although whole genome assemblers are not designed to assemble metagenomic samples, they are being used for metagenomics due to the lack of assemblers capable of dealing with metagenomic samples. We present an evaluation of assembly of simulated short-read metagenomic samples using a state-of-art de Bruijn graph based assembler.
We assembled simulated metagenomic reads from datasets of various complexities using a state-of-art de Bruijn graph based parallel assembler. We have also studied the effect of k-mer size used in de Bruijn graph on metagenomic assembly and developed a clustering solution to pool the contigs obtained from different assembly runs, which allowed us to obtain longer contigs. We have also assessed the degree of chimericity of the assembled contigs using an entropy/impurity metric and compared the metagenomic assemblies to assemblies of isolated individual source genomes.
Our results show that accuracy of the assembled contigs was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains. Clustering contigs from different k-mer parameter of the de Bruijn graph allowed us to obtain longer contigs, however the clustering resulted in accumulation of erroneous contigs thus increasing the error rate in clustered contigs.
由于存在来自多种生物体的遗传物质,宏基因组组装是一个具有挑战性的问题。当使用下一代测序技术产生的短读长时,问题变得更加困难。尽管全基因组组装器不是为组装宏基因组样本而设计的,但由于缺乏能够处理宏基因组样本的组装器,因此它们被用于宏基因组学。我们使用基于最先进的 de Bruijn 图的组装器评估了模拟短读长宏基因组样本的组装。
我们使用基于最先进的 de Bruijn 图的并行组装器,从各种复杂程度的数据集组装模拟的宏基因组读长。我们还研究了 de Bruijn 图中使用的 k-mer 大小对宏基因组组装的影响,并开发了一种聚类解决方案来汇集来自不同组装运行的 contigs,这使我们能够获得更长的 contigs。我们还使用熵/不纯度度量评估了组装 contigs 的嵌合程度,并将宏基因组组装与单独的源基因组组装进行了比较。
我们的结果表明,对于少数优势生物体的宏基因组样本,组装 contigs 的准确性优于预期,而对于包含许多密切相关菌株的样本则尤其差。从 de Bruijn 图的不同 k-mer 参数聚类 contigs 允许我们获得更长的 contigs,但聚类会导致错误 contigs 的积累,从而增加聚类 contigs 的错误率。