基于模拟下一代测序数据的宏基因组组装评估。
Assessment of metagenomic assembly using simulated next generation sequencing data.
机构信息
European Molecular Biology Laboratory, Heidelberg, Germany.
出版信息
PLoS One. 2012;7(2):e31386. doi: 10.1371/journal.pone.0031386. Epub 2012 Feb 23.
Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.
由于协议的复杂性和对微生物群落性质的有限了解,模拟宏基因组序列在利用宏基因组数据测试现有工具和数据分析方法的性能方面起着重要作用。我们开发了具有特定平台(Sanger、焦磷酸测序、Illumina)碱基错误模型的宏基因组读取模拟器,并模拟了具有不同群落复杂性的宏基因组。我们首先评估了严格质量控制对 Illumina 数据的影响。虽然质量过滤去除了很大一部分数据,但它大大提高了结果组装的准确性和 contig 长度。然后,我们将经过质量修剪的 Illumina 组装与 Sanger 和焦磷酸测序的组装进行了比较。对于简单的群落(10 个基因组),所有测序技术都组装了相似数量的 contig,并准确地代表了预期的功能组成。对于更复杂的群落(100 个基因组),Illumina 产生了最好的组装,并且更准确地反映了预期的功能组成。对于最复杂的群落(400 个基因组),任何测序技术的 reads 几乎都没有组装。然而,由于读长较长,Sanger reads 仍然相当准确地代表了整体功能组成。我们进一步研究了使用 Illumina 配对末端 reads 进行 contig 支架的效果。它显著增加了简单群落的 contig 长度,并对更复杂的群落产生了较小的改进。尽管 contig 长度的增加伴随着嵌合体的增加,但它产生了更完整的基因,并更好地描述了功能库。本研究开发的宏基因组模拟器是免费提供的。