RepeatFS：一种通过起源和自动化提供可重复性的文件系统。

RepeatFS: a file system providing reproducibility through provenance and automation.

机构信息

Department of Computer Science.

Hubbard Center for Genome Studies.

出版信息

Bioinformatics. 2021 Jun 9;37(9):1292-1296. doi: 10.1093/bioinformatics/btaa950.

DOI:10.1093/bioinformatics/btaa950

PMID:33230554

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8189677/

Abstract

MOTIVATION

Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation.

RESULTS

We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences.

AVAILABILITYAND IMPLEMENTATION

RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

可重复性对于科学过程至关重要。在大数据时代，由于生物信息学分析通常涉及在 TB 级数据上运行的复杂多应用程序管道，因此难以始终如一地复制和验证实验结果。这些过程导致数据准备步骤、软件版本和命令行参数的可能排列组合达到数千种。现有的可重复性框架繁琐，并且涉及重新设计计算方法。为了解决这些问题，我们开发了 RepeatFS，这是一种文件系统，它可以记录、复制和验证信息学工作流程，而不会对原始方法进行任何更改。RepeatFS 还提供了其他一些功能，以帮助促进分析透明度和可重复性，包括来源可视化和任务自动化。

结果

我们使用 RepeatFS 成功地可视化和复制了各种生物信息学任务，这些任务由超过一百万次操作组成，而不会对原始方法进行任何更改。RepeatFS 正确识别了导致复制差异的所有软件不一致。

可用性和实现

RepeatFS 是用 Python 3 实现的。其源代码和文档可在 https://github.com/ToniWestbrook/repeatfs 上获得。

补充信息

补充数据可在生物信息学在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb8/8189677/920c9af7a411/btaa950f1.jpg

相似文献

RepeatFS: a file system providing reproducibility through provenance and automation.RepeatFS：一种通过起源和自动化提供可重复性的文件系统。

Bioinformatics. 2021 Jun 9;37(9):1292-1296. doi: 10.1093/bioinformatics/btaa950.

Tibanna: software for scalable execution of portable pipelines on the cloud.Tibanna：用于在云端可扩展执行可移植管道的软件。

Bioinformatics. 2019 Nov 1;35(21):4424-4426. doi: 10.1093/bioinformatics/btz379.

Sequence database versioning for command line and Galaxy bioinformatics servers.用于命令行和Galaxy生物信息学服务器的序列数据库版本控制。

Bioinformatics. 2016 Apr 15;32(8):1275-7. doi: 10.1093/bioinformatics/btv724. Epub 2015 Dec 12.

Container-based bioinformatics with Pachyderm.基于容器的生物信息学与 Pachyderm。

Bioinformatics. 2019 Mar 1;35(5):839-846. doi: 10.1093/bioinformatics/bty699.

Facilitating bioinformatics reproducibility with QIIME 2 Provenance Replay.使用 QIIME 2 Provenance Replay 促进生物信息学可重复性。

PLoS Comput Biol. 2023 Nov 27;19(11):e1011676. doi: 10.1371/journal.pcbi.1011676. eCollection 2023 Nov.

NeuroPycon: An open-source python toolbox for fast multi-modal and reproducible brain connectivity pipelines.NeuroPycon：一个开源的 Python 工具包，用于快速进行多模态和可重复的脑连接管道。

Neuroimage. 2020 Oct 1;219:117020. doi: 10.1016/j.neuroimage.2020.117020. Epub 2020 Jun 6.

Microbench: automated metadata management for systems biology benchmarking and reproducibility in Python.微基准：用于系统生物学基准测试和 Python 可重复性的自动化元数据管理。

Bioinformatics. 2022 Oct 14;38(20):4823-4825. doi: 10.1093/bioinformatics/btac580.

Bioinformatics recipes: creating, executing and distributing reproducible data analysis workflows.生物信息学食谱：创建、执行和分发可重复的数据分析工作流程。

BMC Bioinformatics. 2020 Jul 8;21(1):292. doi: 10.1186/s12859-020-03602-6.

vcf2gwas: Python API for comprehensive GWAS analysis using GEMMA.vcf2gwas：使用 GEMMA 进行全面 GWAS 分析的 Python API。

Bioinformatics. 2022 Jan 12;38(3):839-840. doi: 10.1093/bioinformatics/btab710.

Scikick: A sidekick for workflow clarity and reproducibility during extensive data analysis.Scikick：在进行大规模数据分析时，用于提高工作流程清晰度和可重复性的助手。

PLoS One. 2023 Jul 27;18(7):e0289171. doi: 10.1371/journal.pone.0289171. eCollection 2023.

引用本文的文献

Provenance Information for Biomedical Data and Workflows: Scoping Review.生物医学数据和工作流程的出处信息：范围综述。

J Med Internet Res. 2024 Aug 23;26:e51297. doi: 10.2196/51297.

本文引用的文献

Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2.使用QIIME 2进行可重复、交互式、可扩展和可延伸的微生物组数据科学研究。

Nat Biotechnol. 2019 Aug;37(8):852-857. doi: 10.1038/s41587-019-0209-9.

Script of Scripts: A pragmatic workflow system for daily computational research.脚本之脚本：日常计算研究的实用工作流系统。

PLoS Comput Biol. 2019 Feb 27;15(2):e1006843. doi: 10.1371/journal.pcbi.1006843. eCollection 2019 Feb.

Experimenting with reproducibility: a case study of robustness in bioinformatics.实验可重复性：生物信息学稳健性的案例研究。

Gigascience. 2018 Jul 1;7(7). doi: 10.1093/gigascience/giy077.

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.Galaxy 平台：用于可访问、可重复和协作的生物医学分析：2018 年更新。

Nucleic Acids Res. 2018 Jul 2;46(W1):W537-W544. doi: 10.1093/nar/gky379.

Does health informatics have a replication crisis?健康信息学是否存在复制危机？

J Am Med Inform Assoc. 2018 Aug 1;25(8):963-968. doi: 10.1093/jamia/ocy028.

Investigating reproducibility and tracking provenance - A genomic workflow case study.研究可重复性与追溯来源——一个基因组工作流程案例研究

BMC Bioinformatics. 2017 Jul 12;18(1):337. doi: 10.1186/s12859-017-1747-0.

Genomics pipelines and data integration: challenges and opportunities in the research setting.基因组学流程与数据整合：研究环境中的挑战与机遇

Expert Rev Mol Diagn. 2017 Mar;17(3):225-237. doi: 10.1080/14737159.2017.1282822. Epub 2017 Jan 25.

Where next for the reproducibility agenda in computational biology?计算生物学领域的可重复性议程接下来何去何从？

BMC Syst Biol. 2016 Jul 15;10(1):52. doi: 10.1186/s12918-016-0288-x.

1,500 scientists lift the lid on reproducibility.1500名科学家揭开了可重复性的盖子。

Nature. 2016 May 26;533(7604):452-4. doi: 10.1038/533452a.

Trimmomatic: a flexible trimmer for Illumina sequence data.Trimmomatic：一款适用于 Illumina 测序数据的灵活修剪工具。

Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

RepeatFS：一种通过起源和自动化提供可重复性的文件系统。

RepeatFS: a file system providing reproducibility through provenance and automation.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITYAND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献