Sztuka Marek, Kotlarz Krzysztof, Mielczarek Magda, Hajduk Piotr, Liu Jakub, Szyda Joanna
Wroclaw University of Environmental and Life Sciences, Department of Genetics, the Biostatistics Group Kozuchowska 7, Wroclaw PL-51631, Poland.
University Cancer Diagnostic Center, Poznan University of Medical Science, Fredry 10, Poznan 61-701, Poland.
NAR Genom Bioinform. 2024 Apr 29;6(2):lqae040. doi: 10.1093/nargab/lqae040. eCollection 2024 Jun.
This study compared computational approaches to parallelization of an SNP calling workflow. The data comprised DNA from five Holstein-Friesian cows sequenced with the Illumina platform. The pipeline consisted of quality control, alignment to the reference genome, post-alignment, and SNP calling. Three approaches to parallelization were compared: (i) a plain Bash script in which a pipeline for each cow was executed as separate processes invoked at the same time, (ii) a Bash script wrapped in a single Nextflow process and (iii) a Nextflow script with each component of the pipeline defined as a separate process. The results demonstrated that on average, the multi-process Nextflow script performed 15-27% faster depending on the number of assigned threads, with the biggest execution time advantage over the plain Bash approach observed with 10 threads. In terms of RAM usage, the most substantial variation was observed for the multi-process Nextflow, for which it increased with the number of assigned threads, while RAM consumption of the other setups did not depend much on the number of threads assigned for computations. Due to intermediate and log files generated, disk usage was markedly higher for the multi-process Nextflow than for the plain Bash and for the single-process Nextflow.
本研究比较了单核苷酸多态性(SNP)检测工作流程并行化的计算方法。数据包括来自5头用Illumina平台测序的荷斯坦-弗里生奶牛的DNA。该流程包括质量控制、与参考基因组比对、比对后处理以及SNP检测。比较了三种并行化方法:(i)一个普通的Bash脚本,其中针对每头奶牛的流程作为同时调用的单独进程执行;(ii)一个包装在单个Nextflow进程中的Bash脚本;以及(iii)一个Nextflow脚本,其中流程的每个组件都定义为一个单独的进程。结果表明,平均而言,多进程Nextflow脚本的执行速度根据分配的线程数快15%-27%,在分配10个线程时,与普通Bash方法相比,执行时间优势最大。在随机存取存储器(RAM)使用方面,多进程Nextflow的变化最为显著,其随着分配的线程数增加而增加,而其他设置的RAM消耗在很大程度上不依赖于分配用于计算的线程数。由于生成了中间文件和日志文件,多进程Nextflow的磁盘使用量明显高于普通Bash和单进程Nextflow。