Institute of Information Science, Academia Sinica, Taipei 115, Taiwan.
Proc Natl Acad Sci U S A. 2013 Sep 3;110(36):E3417-24. doi: 10.1073/pnas.1314090110. Epub 2013 Aug 21.
Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, we propose an extension-based assembler, called JR-Assembler, where J and R stand for "jumping" extension and read "remapping." First, it uses the read count to select good quality reads as seeds. Second, it extends each seed by a whole-read extension process, which expedites the extension process and can jump over short repeats. Third, it uses a dynamic back trimming process to avoid extension termination due to sequencing errors. Fourth, it remaps reads to each assembled sequence, and if an assembly error occurs by the presence of a repeat, it breaks the contig at the repeat boundaries. Fifth, it applies a less stringent extension criterion to connect low-coverage regions. Finally, it merges contigs by unused reads. An extensive comparison of JR-Assembler with current assemblers using datasets from small, medium, and large genomes shows that JR-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less central processing unit time, especially for large genomes. Finally, a simulation study shows that JR-Assembler achieves a superior performance on memory use and central processing unit time than most current assemblers when the read length is 150 bp or longer, indicating that the advantages of JR-Assembler over current assemblers will increase as the read length increases with advances in next generation sequencing technology.
使用下一代测序reads 组装大型基因组需要大量的计算机内存和较长的执行时间。为了减少这些需求,我们提出了一种基于扩展的组装器,称为 JR-Assembler,其中 J 和 R 分别代表“跳跃”扩展和读取“重映射”。首先,它使用读取计数来选择高质量的reads 作为种子。其次,它通过全读扩展过程扩展每个种子,这可以加快扩展过程并跳过短重复。第三,它使用动态回溯修剪过程来避免由于测序错误导致的扩展终止。第四,它将reads 重新映射到每个组装的序列上,如果由于存在重复而导致组装错误,则在重复边界处打断连续体。第五,它应用较不严格的扩展标准来连接低覆盖率区域。最后,它通过未使用的reads 合并 contigs。使用来自小、中、大基因组的数据集,对 JR-Assembler 与当前组装器进行了广泛的比较,结果表明 JR-Assembler 实现了更好或可比的整体组装质量,并且需要更少的内存和更少的中央处理器时间,特别是对于大型基因组。最后,一项模拟研究表明,当读取长度为 150bp 或更长时,JR-Assembler 在内存使用和中央处理器时间方面的性能优于大多数当前的组装器,这表明随着下一代测序技术的发展,JR-Assembler 相对于当前组装器的优势将随着读取长度的增加而增加。