Suppr超能文献

研究可重复性与追溯来源——一个基因组工作流程案例研究

Investigating reproducibility and tracking provenance - A genomic workflow case study.

作者信息

Kanwal Sehrish, Khan Farah Zaib, Lonie Andrew, Sinnott Richard O

机构信息

Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3010, Australia.

Melbourne Bioinformatics, The University of Melbourne, Melbourne, VIC, 3010, Australia.

出版信息

BMC Bioinformatics. 2017 Jul 12;18(1):337. doi: 10.1186/s12859-017-1747-0.

Abstract

BACKGROUND

Computational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance information should be tracked and used to capture all these requirements supporting reusability of existing workflows.

RESULTS

We have implemented a complex but widely deployed bioinformatics workflow using three representative approaches to workflow definition and execution. Through implementation, we identified assumptions implicit in these approaches that ultimately produce insufficient documentation of workflow requirements resulting in failed execution of the workflow. This study proposes a set of recommendations that aims to mitigate these assumptions and guides the scientific community to accomplish reproducible science, hence addressing reproducibility crisis.

CONCLUSIONS

Reproducing, adapting or even repeating a bioinformatics workflow in any environment requires substantial technical knowledge of the workflow execution environment, resolving analysis assumptions and rigorous compliance with reproducibility requirements. Towards these goals, we propose conclusive recommendations that along with an explicit declaration of workflow specification would result in enhanced reproducibility of computational genomic analyses.

摘要

背景

计算生物信息学工作流程被广泛用于分析基因组数据,有不同的方法可用于支持这些工作流程的实施和执行。可重复性是任何科学工作流程的核心原则之一,仍然是一个尚未得到充分解决的挑战。这是由于对可重复性要求以及工作流程定义方法的假设理解不完整。应该跟踪来源信息并用于捕捉所有这些支持现有工作流程可重用性的要求。

结果

我们使用三种具有代表性的工作流程定义和执行方法实施了一个复杂但广泛部署的生物信息学工作流程。通过实施,我们确定了这些方法中隐含的假设,这些假设最终导致工作流程要求的文档记录不足,从而导致工作流程执行失败。本研究提出了一套建议,旨在减轻这些假设,并指导科学界实现可重复的科学,从而解决可重复性危机。

结论

在任何环境中重现、改编甚至重复生物信息学工作流程都需要对工作流程执行环境有大量的技术知识,解决分析假设并严格遵守可重复性要求。为了实现这些目标,我们提出了确定性的建议,这些建议与工作流程规范的明确声明一起将提高计算基因组分析的可重复性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d071/5508699/5fee2ca6afc1/12859_2017_1747_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验