Decap Dries, Reumers Joke, Herzeel Charlotte, Costanza Pascal, Fostier Jan
Department of Information Technology, Ghent University - iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium.
ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Janssen Research & Development, a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium.
Bioinformatics. 2015 Aug 1;31(15):2482-8. doi: 10.1093/bioinformatics/btv179. Epub 2015 Mar 26.
Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.
We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.
测序后的DNA分析通常包括读段比对,然后是变异检测。特别是对于全基因组测序,即使在多核机器上使用多线程,这个计算步骤也非常耗时。
我们提出了Halvade,这是一个能够使测序流程在多节点和/或多核计算基础设施上高效并行执行的框架。例如,已根据GATK最佳实践建议实现了用于变异检测的DNA测序分析流程,支持全基因组和全外显子组测序。使用一个总共具有360个CPU核心的15节点计算机集群,Halvade在不到3小时的时间内以非常高的并行效率处理了NA12878数据集(人类,100bp双端读段,50×覆盖度)。即使在单个多核机器上,与使用多线程运行单个工具相比,Halvade也实现了显著的加速。