PipeMEM：一种在 Spark 中使用低开销加速 BWA-MEM 的框架。

PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead.

机构信息

Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road 381, Guangzhou 51000, China.

出版信息

Genes (Basel). 2019 Nov 4;10(11):886. doi: 10.3390/genes10110886.

DOI:10.3390/genes10110886

PMID:31689965

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6896194/

Abstract

(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM suffer from higher overhead. (2) Methods: In this paper, we presented PipeMEM, a framework to accelerate BWA-MEM with lower overhead with the help of the pipe operation in Spark. We additionally proposed to use a pipeline structure and in-memory-computation to accelerate PipeMEM. (3) Results: Our experiments showed that, on paired-end alignment tasks, our framework had low overhead. In a multi-node environment, our framework, on average, was 2.27× faster compared with BWASpark (an alignment tool in Genome Analysis Toolkit (GATK)), and 2.33× faster compared with SparkBWA. (4) Conclusions: PipeMEM could accelerate BWA-MEM in the Spark environment with high performance and low overhead.

摘要

（1）背景：DNA 序列比对过程是基因组分析的重要步骤。BWA-MEM 由于其速度快、准确性高，已成为流行的单节点基因组比对工具。目前，指数级生成的基因组数据需要多节点解决方案来处理大量数据，这仍然是一个挑战。Spark 是一种无处不在的大数据平台，已被用于协助基因组比对来应对这一挑战。然而，利用 Spark 优化 BWA-MEM 的现有工作存在较高的开销。（2）方法：在本文中，我们提出了 PipeMEM，这是一个在 Spark 的管道操作的帮助下加速 BWA-MEM 并降低开销的框架。我们还提出使用管道结构和内存计算来加速 PipeMEM。（3）结果：我们的实验表明，在处理配对末端比对任务时，我们的框架开销较低。在多节点环境中，与 BWASpark（基因组分析工具包（GATK）中的一种比对工具）相比，我们的框架平均快 2.27 倍，与 SparkBWA 相比快 2.33 倍。（4）结论：PipeMEM 可以在 Spark 环境中以高性能和低开销加速 BWA-MEM。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ca49/6896194/dc79ea031495/genes-10-00886-g001.jpg

相似文献

PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead.PipeMEM：一种在 Spark 中使用低开销加速 BWA-MEM 的框架。

Genes (Basel). 2019 Nov 4;10(11):886. doi: 10.3390/genes10110886.

Faster single-end alignment generation utilizing multi-thread for BWA.利用多线程实现更快的BWA单端比对生成。

Biomed Mater Eng. 2015;26 Suppl 1:S1791-6. doi: 10.3233/BME-151480.

Multi-threading the generation of Burrows-Wheeler Alignment.多线程生成布罗-惠勒比对。

Genet Mol Res. 2016 May 23;15(2):gmr8650. doi: 10.4238/gmr.15028650.

Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.调整校正器可提高多物种序列数据的准确性并减少计算时间。

Microb Genom. 2017 Jul 8;3(9):e000122. doi: 10.1099/mgen.0.000122. eCollection 2017 Sep.

ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark：一种可扩展的基于 Spark 的单倍型调用程序，利用自适应数据分段来加速变异调用。

BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.

A hybrid short read mapping accelerator.一种混合短读映射加速器。

BMC Bioinformatics. 2013 Feb 26;14:67. doi: 10.1186/1471-2105-14-67.

ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads.ViraPipe：用于从下一代测序读取中进行病毒宏基因组分析的可扩展并行管道。

Bioinformatics. 2018 Mar 15;34(6):928-935. doi: 10.1093/bioinformatics/btx702.

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce.基于 Hadoop Map-Reduce 的基因组序列中 SNPs 检测的快速可扩展工作流。

Genes (Basel). 2020 Feb 5;11(2):166. doi: 10.3390/genes11020166.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.SparkBWA：加速高通量DNA测序数据比对

PLoS One. 2016 May 16;11(5):e0155461. doi: 10.1371/journal.pone.0155461. eCollection 2016.

Systematic benchmark of ancient DNA read mapping.系统评估古 DNA 读段映射。

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab076.

引用本文的文献

Bioinformatics characterization of variants of uncertain significance in pediatric sensorineural hearing loss.儿童感音神经性听力损失中意义不明确变异的生物信息学特征分析

Front Pediatr. 2024 Feb 21;12:1299341. doi: 10.3389/fped.2024.1299341. eCollection 2024.

Multi-Omics Characterization of Circular RNA-Encoded Novel Proteins Associated With Bladder Outlet Obstruction.与膀胱出口梗阻相关的环状RNA编码新蛋白的多组学特征分析

Front Cell Dev Biol. 2022 Jan 7;9:772534. doi: 10.3389/fcell.2021.772534. eCollection 2021.

CircRNA expression profiling of PBMCs from patients with hepatocellular carcinoma by RNA-sequencing.通过RNA测序对肝细胞癌患者外周血单核细胞的环状RNA表达谱分析

Exp Ther Med. 2021 Dec;22(6):1467. doi: 10.3892/etm.2021.10902. Epub 2021 Oct 20.

VC@Scale: Scalable and high-performance variant calling on cluster environments.VC@Scale：在集群环境中进行可扩展且高性能的变体调用。

Gigascience. 2021 Sep 7;10(9). doi: 10.1093/gigascience/giab057.

Bioinformatics Accelerates the Major Tetrad: A Real Boost for the Pharmaceutical Industry.生物信息学加速四大发现：为制药行业注入强大动力。

Int J Mol Sci. 2021 Jun 8;22(12):6184. doi: 10.3390/ijms22126184.

Big Data in metagenomics: Apache Spark vs MPI.宏基因组学中的大数据：Apache Spark 与 MPI。

PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.

本文引用的文献

GenomeScope: fast reference-free genome profiling from short reads.GenomeScope：从短读长实现快速无参基因组剖析。

Bioinformatics. 2017 Jul 15;33(14):2202-2204. doi: 10.1093/bioinformatics/btx153.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.SparkBWA：加速高通量DNA测序数据比对

PLoS One. 2016 May 16;11(5):e0155461. doi: 10.1371/journal.pone.0155461. eCollection 2016.

BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies.BigBWA：使Burrows-Wheeler比对器向大数据技术靠拢

Bioinformatics. 2015 Dec 15;31(24):4003-5. doi: 10.1093/bioinformatics/btv506. Epub 2015 Aug 30.

SpeedSeq: ultra-fast personal genome analysis and interpretation.SpeedSeq：超快速个人基因组分析与解读

Nat Methods. 2015 Oct;12(10):966-8. doi: 10.1038/nmeth.3505. Epub 2015 Aug 10.

PhyResSE: a Web Tool Delineating Mycobacterium tuberculosis Antibiotic Resistance and Lineage from Whole-Genome Sequencing Data.PhyResSE：一种从全基因组测序数据中描绘结核分枝杆菌抗生素耐药性和谱系的网络工具。

J Clin Microbiol. 2015 Jun;53(6):1908-14. doi: 10.1128/JCM.00025-15. Epub 2015 Apr 8.

High-Performance Integrated Virtual Environment (HIVE) Tools and Applications for Big Data Analysis.高性能集成虚拟环境（HIVE）工具及其在大数据分析中的应用。

Genes (Basel). 2014 Sep 30;5(4):957-81. doi: 10.3390/genes5040957.

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.SparkSeq：一种快速、可扩展且适用于云环境的工具，可实现具有核苷酸精度的交互式基因组数据分析。

Bioinformatics. 2014 Sep 15;30(18):2652-3. doi: 10.1093/bioinformatics/btu343. Epub 2014 May 19.

SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications.SSW 库：一个用于基因组应用的 SIMD Smith-Waterman C/C++ 库。

PLoS One. 2013 Dec 4;8(12):e82138. doi: 10.1371/journal.pone.0082138. eCollection 2013.

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data.BioPig：一个基于 Hadoop 的大规模序列数据分析工具包。

Bioinformatics. 2013 Dec 1;29(23):3014-9. doi: 10.1093/bioinformatics/btt528. Epub 2013 Sep 10.

Next generation sequence analysis and computational genomics using graphical pipeline workflows.下一代序列分析和使用图形管道工作流的计算基因组学。

Genes (Basel). 2012 Aug 30;3(3):545-75. doi: 10.3390/genes3030545.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

PipeMEM：一种在 Spark 中使用低开销加速 BWA-MEM 的框架。

PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献