Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road 381, Guangzhou 51000, China.
Genes (Basel). 2019 Nov 4;10(11):886. doi: 10.3390/genes10110886.
(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM suffer from higher overhead. (2) Methods: In this paper, we presented PipeMEM, a framework to accelerate BWA-MEM with lower overhead with the help of the pipe operation in Spark. We additionally proposed to use a pipeline structure and in-memory-computation to accelerate PipeMEM. (3) Results: Our experiments showed that, on paired-end alignment tasks, our framework had low overhead. In a multi-node environment, our framework, on average, was 2.27× faster compared with BWASpark (an alignment tool in Genome Analysis Toolkit (GATK)), and 2.33× faster compared with SparkBWA. (4) Conclusions: PipeMEM could accelerate BWA-MEM in the Spark environment with high performance and low overhead.
(1)背景:DNA 序列比对过程是基因组分析的重要步骤。BWA-MEM 由于其速度快、准确性高,已成为流行的单节点基因组比对工具。目前,指数级生成的基因组数据需要多节点解决方案来处理大量数据,这仍然是一个挑战。Spark 是一种无处不在的大数据平台,已被用于协助基因组比对来应对这一挑战。然而,利用 Spark 优化 BWA-MEM 的现有工作存在较高的开销。(2)方法:在本文中,我们提出了 PipeMEM,这是一个在 Spark 的管道操作的帮助下加速 BWA-MEM 并降低开销的框架。我们还提出使用管道结构和内存计算来加速 PipeMEM。(3)结果:我们的实验表明,在处理配对末端比对任务时,我们的框架开销较低。在多节点环境中,与 BWASpark(基因组分析工具包(GATK)中的一种比对工具)相比,我们的框架平均快 2.27 倍,与 SparkBWA 相比快 2.33 倍。(4)结论:PipeMEM 可以在 Spark 环境中以高性能和低开销加速 BWA-MEM。