Department of Computer Science.
Hubbard Center for Genome Studies.
Bioinformatics. 2021 Jun 9;37(9):1292-1296. doi: 10.1093/bioinformatics/btaa950.
Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation.
We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences.
RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs.
Supplementary data are available at Bioinformatics online.
可重复性对于科学过程至关重要。在大数据时代,由于生物信息学分析通常涉及在 TB 级数据上运行的复杂多应用程序管道,因此难以始终如一地复制和验证实验结果。这些过程导致数据准备步骤、软件版本和命令行参数的可能排列组合达到数千种。现有的可重复性框架繁琐,并且涉及重新设计计算方法。为了解决这些问题,我们开发了 RepeatFS,这是一种文件系统,它可以记录、复制和验证信息学工作流程,而不会对原始方法进行任何更改。RepeatFS 还提供了其他一些功能,以帮助促进分析透明度和可重复性,包括来源可视化和任务自动化。
我们使用 RepeatFS 成功地可视化和复制了各种生物信息学任务,这些任务由超过一百万次操作组成,而不会对原始方法进行任何更改。RepeatFS 正确识别了导致复制差异的所有软件不一致。
RepeatFS 是用 Python 3 实现的。其源代码和文档可在 https://github.com/ToniWestbrook/repeatfs 上获得。
补充数据可在生物信息学在线获得。