BGI HK Research Institute, 16 Dai Fu Street, Tai Po Industrial Estate, Hong Kong.
Gigascience. 2012 Dec 27;1(1):18. doi: 10.1186/2047-217X-1-18.
There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.
To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.
Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.
使用新一代测序(NGS)短读长进行从头基因组组装的数量正在迅速增加;然而,要实现高效和准确,仍有几个大的挑战需要克服。SOAPdenovo 已成功应用于组装许多已发表的基因组,但在连续性、准确性和覆盖率方面仍需要改进,尤其是在重复区域。
为了克服这些挑战,我们开发了它的后继者 SOAPdenovo2,它具有新算法设计的优势,可以减少图构建中的内存消耗,解决更多重复区域的组装,增加支架构建的覆盖度和长度,提高缺口闭合,以及优化大基因组。
使用 Assemblathon1 和 GAGE 数据集进行基准测试表明,SOAPdenovo2 大大超过了其前身 SOAPdenovo,并在组装长度和准确性方面与其他组装器具有竞争力。我们还使用 SOAPdenovo2 提供了 2008 年亚洲(YH)基因组的更新组装版本。在这里,YH 基因组的 contig 和支架 N50 分别约为 20.9 kbp 和 22 Mbp,分别是第一个公布版本的 3 倍和 50 倍。基因组覆盖率从 81.16%增加到 93.91%,最大内存消耗点的内存消耗降低了约 2/3。