研究可重复性与追溯来源——一个基因组工作流程案例研究

Investigating reproducibility and tracking provenance - A genomic workflow case study.

作者信息

Kanwal Sehrish, Khan Farah Zaib, Lonie Andrew, Sinnott Richard O

机构信息

Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3010, Australia.

Melbourne Bioinformatics, The University of Melbourne, Melbourne, VIC, 3010, Australia.

出版信息

BMC Bioinformatics. 2017 Jul 12;18(1):337. doi: 10.1186/s12859-017-1747-0.

DOI:10.1186/s12859-017-1747-0

PMID:28701218

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5508699/

Abstract

BACKGROUND

Computational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance information should be tracked and used to capture all these requirements supporting reusability of existing workflows.

RESULTS

We have implemented a complex but widely deployed bioinformatics workflow using three representative approaches to workflow definition and execution. Through implementation, we identified assumptions implicit in these approaches that ultimately produce insufficient documentation of workflow requirements resulting in failed execution of the workflow. This study proposes a set of recommendations that aims to mitigate these assumptions and guides the scientific community to accomplish reproducible science, hence addressing reproducibility crisis.

CONCLUSIONS

Reproducing, adapting or even repeating a bioinformatics workflow in any environment requires substantial technical knowledge of the workflow execution environment, resolving analysis assumptions and rigorous compliance with reproducibility requirements. Towards these goals, we propose conclusive recommendations that along with an explicit declaration of workflow specification would result in enhanced reproducibility of computational genomic analyses.

摘要

背景

计算生物信息学工作流程被广泛用于分析基因组数据，有不同的方法可用于支持这些工作流程的实施和执行。可重复性是任何科学工作流程的核心原则之一，仍然是一个尚未得到充分解决的挑战。这是由于对可重复性要求以及工作流程定义方法的假设理解不完整。应该跟踪来源信息并用于捕捉所有这些支持现有工作流程可重用性的要求。

结果

我们使用三种具有代表性的工作流程定义和执行方法实施了一个复杂但广泛部署的生物信息学工作流程。通过实施，我们确定了这些方法中隐含的假设，这些假设最终导致工作流程要求的文档记录不足，从而导致工作流程执行失败。本研究提出了一套建议，旨在减轻这些假设，并指导科学界实现可重复的科学，从而解决可重复性危机。

结论

在任何环境中重现、改编甚至重复生物信息学工作流程都需要对工作流程执行环境有大量的技术知识，解决分析假设并严格遵守可重复性要求。为了实现这些目标，我们提出了确定性的建议，这些建议与工作流程规范的明确声明一起将提高计算基因组分析的可重复性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d071/5508699/5fee2ca6afc1/12859_2017_1747_Fig1_HTML.jpg

相似文献

Investigating reproducibility and tracking provenance - A genomic workflow case study.

BMC Bioinformatics. 2017 Jul 12;18(1):337. doi: 10.1186/s12859-017-1747-0.

Watchdog 2.0: New developments for reusability, reproducibility, and workflow execution.

Gigascience. 2020 Jun 1;9(6). doi: 10.1093/gigascience/giaa068.

Watchdog - a workflow management system for the distributed analysis of large-scale experimental data.

BMC Bioinformatics. 2018 Mar 13;19(1):97. doi: 10.1186/s12859-018-2107-4.

Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv.

Gigascience. 2019 Nov 1;8(11). doi: 10.1093/gigascience/giz095.

Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics.

BMC Bioinformatics. 2018 Nov 29;19(1):457. doi: 10.1186/s12859-018-2446-1.

Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines.

BMC Bioinformatics. 2018 Oct 15;19(Suppl 10):349. doi: 10.1186/s12859-018-2296-x.

Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection.

Gigascience. 2019 Apr 1;8(4). doi: 10.1093/gigascience/giz052.

Tavaxy: integrating Taverna and Galaxy workflows with cloud computing support.

BMC Bioinformatics. 2012 May 4;13:77. doi: 10.1186/1471-2105-13-77.

Experiences with workflows for automating data-intensive bioinformatics.

Biol Direct. 2015 Aug 19;10:43. doi: 10.1186/s13062-015-0071-8.

Simplifying the development of portable, scalable, and reproducible workflows.

Elife. 2021 Oct 13;10:e71069. doi: 10.7554/eLife.71069.

引用本文的文献

A standards perspective on genomic data reusability and reproducibility.

Front Bioinform. 2025 Mar 10;5:1572937. doi: 10.3389/fbinf.2025.1572937. eCollection 2025.

Applying the FAIR Principles to computational workflows.

Sci Data. 2025 Feb 24;12(1):328. doi: 10.1038/s41597-025-04451-9.

Cancer treatment comes to age: from one-size-fits-all to next-generation sequencing (NGS) technologies.

Bioimpacts. 2024;14(4):29957. doi: 10.34172/bi.2023.29957. Epub 2023 Dec 23.

Implementation of FAIR Practices in Computational Metabolomics Workflows-A Case Study.

Metabolites. 2024 Feb 10;14(2):118. doi: 10.3390/metabo14020118.

An Automated Workflow Composition System for Liquid Chromatography-Mass Spectrometry Metabolomics Data Processing.

J Am Soc Mass Spectrom. 2023 Dec 6;34(12):2857-2863. doi: 10.1021/jasms.3c00248. Epub 2023 Oct 24.

Data Provenance in Biomedical Research: Scoping Review.

J Med Internet Res. 2023 Mar 27;25:e42289. doi: 10.2196/42289.

PEGR: a flexible management platform for reproducible epigenomic and genomic research.

Genome Biol. 2022 Apr 19;23(1):99. doi: 10.1186/s13059-022-02671-5.

RESCRIPt: Reproducible sequence taxonomy reference database management.

PLoS Comput Biol. 2021 Nov 8;17(11):e1009581. doi: 10.1371/journal.pcbi.1009581. eCollection 2021 Nov.

Orchestrating and sharing large multimodal data for transparent and reproducible research.

Nat Commun. 2021 Oct 4;12(1):5797. doi: 10.1038/s41467-021-25974-w.

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers.

Nat Methods. 2021 Oct;18(10):1161-1168. doi: 10.1038/s41592-021-01254-9. Epub 2021 Sep 23.

本文引用的文献

Enhancing reproducibility for computational methods.

Science. 2016 Dec 9;354(6317):1240-1241. doi: 10.1126/science.aah6168.

1,500 scientists lift the lid on reproducibility.

Nature. 2016 May 26;533(7604):452-4. doi: 10.1038/533452a.

A review of bioinformatic pipeline frameworks.

Brief Bioinform. 2017 May 1;18(3):530-536. doi: 10.1093/bib/bbw020.

Genomics Virtual Laboratory: A Practical Bioinformatics Workbench for the Cloud.

PLoS One. 2015 Oct 26;10(10):e0140829. doi: 10.1371/journal.pone.0140829. eCollection 2015.

Use of semantic workflows to enhance transparency and reproducibility in clinical omics.

Genome Med. 2015 Jul 25;7(1):73. doi: 10.1186/s13073-015-0202-y.

Cpipe: a shared variant detection pipeline designed for diagnostic settings.

Genome Med. 2015 Jul 10;7(1):68. doi: 10.1186/s13073-015-0191-x. eCollection 2015.

Omics Pipe: a community-based framework for reproducible multi-omics data analysis.

Bioinformatics. 2015 Jun 1;31(11):1724-8. doi: 10.1093/bioinformatics/btv061. Epub 2015 Jan 30.

Reproducibility in science: improving the standard for basic and preclinical research.

Circ Res. 2015 Jan 2;116(1):116-26. doi: 10.1161/CIRCRESAHA.114.303819.

Unipro UGENE NGS pipelines and components for variant calling, RNA-seq and ChIP-seq data analyses.

PeerJ. 2014 Nov 4;2:e644. doi: 10.7717/peerj.644. eCollection 2014.

The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud.

Nucleic Acids Res. 2013 Jul;41(Web Server issue):W557-61. doi: 10.1093/nar/gkt328. Epub 2013 May 2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

研究可重复性与追溯来源——一个基因组工作流程案例研究

Investigating reproducibility and tracking provenance - A genomic workflow case study.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献