Samuel Sheeba, König-Ries Birgitta
Michael Stifel Center Jena, Jena, Germany.
Heinz Nixdorf Chair for Distributed Information Systems, Friedrich-Schiller Universität Jena, Jena, Thuringia, Germany.
PeerJ Comput Sci. 2022 Mar 10;8:e921. doi: 10.7717/peerj-cs.921. eCollection 2022.
Scientific data management plays a key role in the reproducibility of scientific results. To reproduce results, not only the results but also the data and steps of scientific experiments must be made findable, accessible, interoperable, and reusable. Tracking, managing, describing, and visualizing provenance helps in the understandability, reproducibility, and reuse of experiments for the scientific community. Current systems lack a link between the data, steps, and results from the computational and non-computational processes of an experiment. Such a link, however, is vital for the reproducibility of results. We present a novel solution for the end-to-end provenance management of scientific experiments. We provide a framework, CAESAR (CollAborative Environment for Scientific Analysis with Reproducibility), which allows scientists to capture, manage, query and visualize the complete path of a scientific experiment consisting of computational and non-computational data and steps in an interoperable way. CAESAR integrates the REPRODUCE-ME provenance model, extended from existing semantic web standards, to represent the whole picture of an experiment describing the path it took from its design to its result. ProvBook, an extension for Jupyter Notebooks, is developed and integrated into CAESAR to support computational reproducibility. We have applied and evaluated our contributions to a set of scientific experiments in microscopy research projects.
科学数据管理在科学结果的可重复性方面起着关键作用。为了重现结果,不仅结果本身,而且科学实验的数据和步骤都必须是可查找、可访问、可互操作且可重复使用的。跟踪、管理、描述和可视化数据起源有助于科学界理解、重现和复用实验。当前的系统缺乏实验的计算和非计算过程中的数据、步骤与结果之间的联系。然而,这样的联系对于结果的可重复性至关重要。我们提出了一种用于科学实验端到端数据起源管理的新颖解决方案。我们提供了一个框架CAESAR(具有可重复性的科学分析协作环境),它允许科学家以可互操作的方式捕获、管理、查询和可视化由计算和非计算数据及步骤组成的科学实验的完整路径。CAESAR集成了从现有语义网标准扩展而来的REPRODUCE-ME数据起源模型,以呈现实验的全貌,描述从设计到结果所经过的路径。为Jupyter Notebook开发的扩展ProvBook被集成到CAESAR中以支持计算可重复性。我们已将我们的成果应用于显微镜研究项目中的一组科学实验并进行了评估。