Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, Berlin, Germany.
J Med Internet Res. 2023 Mar 27;25:e42289. doi: 10.2196/42289.
Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve reproducibility as well as quality in biomedical research and, therefore, to foster good scientific practice. However, despite the increasing interest on data provenance technologies in the literature and their implementation in other disciplines, these technologies have not yet been widely adopted in biomedical research.
The aim of this scoping review was to provide a structured overview of the body of knowledge on provenance methods in biomedical research by systematizing articles covering data provenance technologies developed for or used in this application area; describing and comparing the functionalities as well as the design of the provenance technologies used; and identifying gaps in the literature, which could provide opportunities for future research on technologies that could receive more widespread adoption.
Following a methodological framework for scoping studies and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, articles were identified by searching the PubMed, IEEE Xplore, and Web of Science databases and subsequently screened for eligibility. We included original articles covering software-based provenance management for scientific research published between 2010 and 2021. A set of data items was defined along the following five axes: publication metadata, application scope, provenance aspects covered, data representation, and functionalities. The data items were extracted from the articles, stored in a charting spreadsheet, and summarized in tables and figures.
We identified 44 original articles published between 2010 and 2021. We found that the solutions described were heterogeneous along all axes. We also identified relationships among motivations for the use of provenance information, feature sets (capture, storage, retrieval, visualization, and analysis), and implementation details such as the data models and technologies used. The important gap that we identified is that only a few publications address the analysis of provenance data or use established provenance standards, such as PROV.
The heterogeneity of provenance methods, models, and implementations found in the literature points to the lack of a unified understanding of provenance concepts for biomedical data. Providing a common framework, a biomedical reference, and benchmarking data sets could foster the development of more comprehensive provenance solutions.
数据溯源是指数据的来源、处理和移动。可靠和准确的数据溯源知识对于提高生物医学研究的可重复性和质量具有巨大潜力,从而促进良好的科学实践。然而,尽管数据溯源技术在文献中越来越受到关注,并在其他学科中得到实施,但这些技术尚未在生物医学研究中得到广泛采用。
本范围综述的目的是通过系统地组织涵盖为该应用领域开发或使用的数据溯源技术的文章,提供生物医学研究中溯源方法的知识体系的结构化概述;描述和比较所使用的溯源技术的功能和设计;并确定文献中的差距,这为未来可能更广泛采用的技术的研究提供了机会。
根据范围研究的方法论框架和 PRISMA-ScR(系统评价和荟萃分析扩展的首选报告项目用于范围综述)指南,通过搜索 PubMed、IEEE Xplore 和 Web of Science 数据库来确定文章,并随后对其进行筛选以确定其是否符合条件。我们纳入了 2010 年至 2021 年间发表的涵盖科学研究中基于软件的溯源管理的原创文章。沿着以下五个轴定义了一组数据项:出版元数据、应用范围、涵盖的溯源方面、数据表示和功能。从文章中提取数据项,存储在图表电子表格中,并在表格和图表中进行总结。
我们确定了 2010 年至 2021 年间发表的 44 篇原创文章。我们发现,所描述的解决方案在所有轴上都是异构的。我们还发现了使用溯源信息的动机、功能集(捕获、存储、检索、可视化和分析)以及所使用的数据模型和技术等实施细节之间的关系。我们发现的一个重要差距是,只有少数出版物涉及对溯源数据的分析或使用 PROV 等既定的溯源标准。
文献中发现的溯源方法、模型和实现的异质性表明,对于生物医学数据的溯源概念缺乏统一的理解。提供一个通用框架、一个生物医学参考和基准数据集可以促进更全面的溯源解决方案的发展。