Program in Ecology, University of Wyoming, Laramie, WY, USA.
Wildlife Genomics and Disease Ecology Laboratory, Department of Veterinary Sciences, University of Wyoming, Laramie, WY, USA.
Mol Ecol Resour. 2020 Mar;20(2):360-370. doi: 10.1111/1755-0998.13108. Epub 2019 Nov 25.
Advances in DNA sequencing have made it feasible to gather genomic data for non-model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD-HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD-HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.
DNA 测序技术的进步使得收集非模式生物和大量个体的基因组数据成为可能,通常使用基因组亚区测序方法。这些方法中的几种方法对与内切酶限制位点相关的 DNA 进行测序(各种 RAD 和 GBS 方法)。对于没有参考基因组的分类单元,这些方法依赖于测序文库中片段的从头组装。许多为此应用提供的软件选项最初是为其他类型的组装而开发的,我们不知道它们在简化表示文库中的准确性。为了解决这个重要的知识差距,我们模拟了拟南芥和人类基因组的数据,并比较了六种常用于或有望用于此目的的软件程序(ABySS、CD-HIT、Stacks、Stacks2、Velvet 和 VSEARCH)的从头组装。我们模拟了不同的突变率和突变类型,然后将这六个组装器应用于模拟数据集,改变组装参数。我们发现软件性能在模拟和参数设置中存在很大差异。ABySS 未能恢复任何真正的基因组片段,而 Velvet 和 VSEARCH 在大多数模拟中表现不佳。Stacks 和 Stacks2 可以准确地组装包含 SNP 的模拟,但添加插入和缺失突变会降低它们的性能。CD-HIT 是唯一一种能够一致地恢复大量真实基因组片段的组装器。在这里,我们展示了不同软件程序组装结果的准确性存在显著差异,以及比较来自不同参数设置的组装结果的重要性。