Centre for Bioinformatics & Systems Biology, Faculty of Science, University of Khartoum, Khartoum, Sudan.
Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Khartoum, Khartoum, Sudan.
PLoS One. 2019 Jul 9;14(7):e0211608. doi: 10.1371/journal.pone.0211608. eCollection 2019.
Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the "best" workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T's data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.
生物信息学研究经常使用具有多个步骤、扇区、合并和条件的复杂工作流程进行。这种复杂性使得在计算机群集上管理工作流程变得困难,尤其是在对大量数据进行并行处理时:一次处理数百或数千个样本。科学工作流管理系统可以提供帮助。现在已经提出了许多系统,但对于生物信息学来说,是否存在“最佳”的工作流管理系统呢?这样的系统需要满足许多有时相互冲突的要求:从易用性,到在 petascale 和 exascale 上的无缝部署,以及到云的可移植性。我们通过用 Swift/T 语言实现主要的基因组变异调用工作流来评估 Swift/T 是否可以作为这样的候选者,重点关注在大规模基因组分析中产生的工作流管理、性能和可伸缩性问题。在此过程中,我们向该语言中引入了一些新特性,这些特性现在是其开放存储库的一部分。此外,我们还为在生产环境中大规模运行的高质量、稳健、可维护的工作流制定了一套设计标准,例如大型基因组测序设施或大型医院系统。使用 Swift/T 可以带来两个关键优势。(1)它可以在多个集群调度环境(PBS Torque、SLURM、Cray aprun 环境等)中透明地运行,因此单个工作流可以轻松地在多个集群之间迁移。(2)Swift/T 的叶函数允许开发人员轻松地在工作流中替换可执行文件,这使得维护和请求每个阶段的管道的最佳资源变得容易。虽然 Swift/T 的数据级并行性消除了对多个样本的并行分析进行编码的需求,但这确实使调试变得更加困难,这在隐式并行代码中很常见。尽管如此,该语言为用户提供了一种强大且可移植的方法,可以在许多计算架构中扩展分析。使用 Swift/T 实现变异调用工作流的代码可以在 GitHub 上找到,网址为 https://github.com/ncsa/Swift-T-Variant-Calling,并在 http://swift-t-variant-calling.readthedocs.io/en/latest/ 上提供了完整的文档。