Halvade：使用MapReduce进行可扩展序列分析。

Halvade: scalable sequence analysis with MapReduce.

作者信息

Decap Dries, Reumers Joke, Herzeel Charlotte, Costanza Pascal, Fostier Jan

机构信息

Department of Information Technology, Ghent University - iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium.

ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Janssen Research & Development, a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium.

出版信息

Bioinformatics. 2015 Aug 1;31(15):2482-8. doi: 10.1093/bioinformatics/btv179. Epub 2015 Mar 26.

DOI:10.1093/bioinformatics/btv179

PMID:25819078

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4514927/

Abstract

MOTIVATION

Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.

RESULTS

We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.

摘要

动机

测序后的DNA分析通常包括读段比对，然后是变异检测。特别是对于全基因组测序，即使在多核机器上使用多线程，这个计算步骤也非常耗时。

结果

我们提出了Halvade，这是一个能够使测序流程在多节点和/或多核计算基础设施上高效并行执行的框架。例如，已根据GATK最佳实践建议实现了用于变异检测的DNA测序分析流程，支持全基因组和全外显子组测序。使用一个总共具有360个CPU核心的15节点计算机集群，Halvade在不到3小时的时间内以非常高的并行效率处理了NA12878数据集（人类，100bp双端读段，50×覆盖度）。即使在单个多核机器上，与使用多线程运行单个工具相比，Halvade也实现了显著的加速。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c82f/4514927/56d3b6b28b2b/btv179f1p.jpg

相似文献

Halvade: scalable sequence analysis with MapReduce.

Bioinformatics. 2015 Aug 1;31(15):2482-8. doi: 10.1093/bioinformatics/btv179. Epub 2015 Mar 26.

Halvade somatic: Somatic variant calling with Apache Spark.

Gigascience. 2022 Jan 12;11(1). doi: 10.1093/gigascience/giab094.

Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce.

PLoS One. 2017 Mar 30;12(3):e0174575. doi: 10.1371/journal.pone.0174575. eCollection 2017.

From Wet-Lab to Variations: Concordance and Speed of Bioinformatics Pipelines for Whole Genome and Whole Exome Sequencing.

Hum Mutat. 2016 Dec;37(12):1263-1271. doi: 10.1002/humu.23114. Epub 2016 Sep 26.

ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.

BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.

Impact of post-alignment processing in variant discovery from whole exome data.

BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.

UNDR ROVER - a fast and accurate variant caller for targeted DNA sequencing.

BMC Bioinformatics. 2016 Apr 16;17:165. doi: 10.1186/s12859-016-1014-9.

Challenges in exome analysis by LifeScope and its alternative computational pipelines.

BMC Res Notes. 2015 Sep 7;8:421. doi: 10.1186/s13104-015-1385-4.

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

Genome Res. 2015 Jun;25(6):918-25. doi: 10.1101/gr.176552.114. Epub 2015 Apr 16.

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19.

引用本文的文献

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021.

Halvade somatic: Somatic variant calling with Apache Spark.

Gigascience. 2022 Jan 12;11(1). doi: 10.1093/gigascience/giab094.

VC@Scale: Scalable and high-performance variant calling on cluster environments.

Gigascience. 2021 Sep 7;10(9). doi: 10.1093/gigascience/giab057.

Cloud Computing Enabled Big Multi-Omics Data Analytics.

Bioinform Biol Insights. 2021 Jul 28;15:11779322211035921. doi: 10.1177/11779322211035921. eCollection 2021.

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.

PLoS One. 2021 Aug 3;16(8):e0255260. doi: 10.1371/journal.pone.0255260. eCollection 2021.

GPrimer: a fast GPU-based pipeline for primer design for qPCR experiments.

BMC Bioinformatics. 2021 Apr 29;22(1):220. doi: 10.1186/s12859-021-04133-4.

Multithreaded variant calling in elPrep 5.

PLoS One. 2021 Feb 4;16(2):e0244471. doi: 10.1371/journal.pone.0244471. eCollection 2021.

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce.

Genes (Basel). 2020 Feb 5;11(2):166. doi: 10.3390/genes11020166.

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark.

Genes (Basel). 2020 Jan 3;11(1):53. doi: 10.3390/genes11010053.

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework.

PLoS One. 2019 Dec 5;14(12):e0224784. doi: 10.1371/journal.pone.0224784. eCollection 2019.

本文引用的文献

From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Curr Protoc Bioinformatics. 2013;43(1110):11.10.1-11.10.33. doi: 10.1002/0471250953.bi1110s43.

Supercomputing for the parallelization of whole genome analysis.

Bioinformatics. 2014 Jun 1;30(11):1508-13. doi: 10.1093/bioinformatics/btu071. Epub 2014 Feb 12.

DistMap: a toolkit for distributed short read mapping on a Hadoop cluster.

PLoS One. 2013 Aug 23;8(8):e72614. doi: 10.1371/journal.pone.0072614. eCollection 2013.

Tools for mapping high-throughput sequencing data.

Bioinformatics. 2012 Dec 15;28(24):3169-77. doi: 10.1093/bioinformatics/bts605. Epub 2012 Oct 11.

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.

Bioinformatics. 2012 Mar 15;28(6):876-7. doi: 10.1093/bioinformatics/bts054. Epub 2012 Feb 2.

Genotype and SNP calling from next-generation sequencing data.

Nat Rev Genet. 2011 Jun;12(6):443-51. doi: 10.1038/nrg2986.

A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.

The impact of next-generation sequencing on genomics.

J Genet Genomics. 2011 Mar 20;38(3):95-109. doi: 10.1016/j.jgg.2011.02.003. Epub 2011 Mar 15.

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19.

BEDTools: a flexible suite of utilities for comparing genomic features.

Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Halvade：使用MapReduce进行可扩展序列分析。

Halvade: scalable sequence analysis with MapReduce.

作者信息

Decap Dries, Reumers Joke, Herzeel Charlotte, Costanza Pascal, Fostier Jan

机构信息

Department of Information Technology, Ghent University - iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium.

ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Janssen Research & Development, a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium.

出版信息

Bioinformatics. 2015 Aug 1;31(15):2482-8. doi: 10.1093/bioinformatics/btv179. Epub 2015 Mar 26.

DOI:10.1093/bioinformatics/btv179

PMID:25819078

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4514927/

Abstract

MOTIVATION

RESULTS

摘要

动机

测序后的DNA分析通常包括读段比对，然后是变异检测。特别是对于全基因组测序，即使在多核机器上使用多线程，这个计算步骤也非常耗时。

Halvade：使用MapReduce进行可扩展序列分析。

Halvade: scalable sequence analysis with MapReduce.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Halvade：使用MapReduce进行可扩展序列分析。

Halvade: scalable sequence analysis with MapReduce.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

本文引用的文献