Kanwal Sehrish, Khan Farah Zaib, Lonie Andrew, Sinnott Richard O
Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3010, Australia.
Melbourne Bioinformatics, The University of Melbourne, Melbourne, VIC, 3010, Australia.
BMC Bioinformatics. 2017 Jul 12;18(1):337. doi: 10.1186/s12859-017-1747-0.
Computational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance information should be tracked and used to capture all these requirements supporting reusability of existing workflows.
We have implemented a complex but widely deployed bioinformatics workflow using three representative approaches to workflow definition and execution. Through implementation, we identified assumptions implicit in these approaches that ultimately produce insufficient documentation of workflow requirements resulting in failed execution of the workflow. This study proposes a set of recommendations that aims to mitigate these assumptions and guides the scientific community to accomplish reproducible science, hence addressing reproducibility crisis.
Reproducing, adapting or even repeating a bioinformatics workflow in any environment requires substantial technical knowledge of the workflow execution environment, resolving analysis assumptions and rigorous compliance with reproducibility requirements. Towards these goals, we propose conclusive recommendations that along with an explicit declaration of workflow specification would result in enhanced reproducibility of computational genomic analyses.
计算生物信息学工作流程被广泛用于分析基因组数据,有不同的方法可用于支持这些工作流程的实施和执行。可重复性是任何科学工作流程的核心原则之一,仍然是一个尚未得到充分解决的挑战。这是由于对可重复性要求以及工作流程定义方法的假设理解不完整。应该跟踪来源信息并用于捕捉所有这些支持现有工作流程可重用性的要求。
我们使用三种具有代表性的工作流程定义和执行方法实施了一个复杂但广泛部署的生物信息学工作流程。通过实施,我们确定了这些方法中隐含的假设,这些假设最终导致工作流程要求的文档记录不足,从而导致工作流程执行失败。本研究提出了一套建议,旨在减轻这些假设,并指导科学界实现可重复的科学,从而解决可重复性危机。
在任何环境中重现、改编甚至重复生物信息学工作流程都需要对工作流程执行环境有大量的技术知识,解决分析假设并严格遵守可重复性要求。为了实现这些目标,我们提出了确定性的建议,这些建议与工作流程规范的明确声明一起将提高计算基因组分析的可重复性。