School of Computer Engineering, Nanyang Technological University, Singapore.
BMC Bioinformatics. 2011 Aug 25;12:354. doi: 10.1186/1471-2105-12-354.
Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads.
We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources.
Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.
下一代测序技术使得 DNA 测序通量呈爆炸式增长,并促进了从头开始的短读序列组装程序的最新发展。然而,现有的组装程序需要很高的执行时间和大量的计算资源,才能从大量的短读序列中组装出大型基因组。
我们提出了 PASHA,这是一种使用 de Bruijn 图的并行化短读序列组装程序,它利用了由共享内存多核 CPU 和分布式内存计算集群组成的混合计算架构,以提高效率和可扩展性。使用三个小规模真实的配对末端数据集进行评估表明,与三个领先的组装程序(Velvet、AByss 和 SOAPdenovo)相比,PASHA 能够在更短的时间内产生更多连续的高质量组装。PASHA 对大型基因组数据集的可扩展性通过人类基因组组装得到了证明。与 ABySS 相比,PASHA 在相同的计算资源上以更快的执行速度实现了具有竞争力的组装质量,产生了 503 的 NG50 连续体大小,最长正确连续体大小为 18252,NG50 支架大小为 2294。此外,仅使用适度的计算资源,人类基因组组装就可以在大约 21 小时内完成。
由于高通量短读数据集的爆炸式增长,开发用于大型基因组的并行组装程序已经引起了广泛的研究关注。通过采用多线程多核 CPU 和计算集群上的消息传递相结合的混合并行性,PASHA 能够使用适度的计算资源以高质量和合理的时间组装人类基因组。