Department of Biomedical Informatics, Columbia University, New York, New York, USA.
Department of Artificial Intelligence and Human Health, Icahn School of Medicine, New York, New York, USA.
J Am Med Inform Assoc. 2022 Nov 14;29(12):2032-2040. doi: 10.1093/jamia/ocac166.
To design and evaluate an interactive data quality (DQ) characterization tool focused on fitness-for-use completeness measures to support researchers' assessment of a dataset.
Design requirements were identified through a conceptual framework on DQ, literature review, and interviews. The prototype of the tool was developed based on the requirements gathered and was further refined by domain experts. The Fitness-for-Use Tool was evaluated through a within-subjects controlled experiment comparing it with a baseline tool that provides information on missing data based on intrinsic DQ measures. The tools were evaluated on task performance and perceived usability.
The Fitness-for-Use Tool allows users to define data completeness by customizing the measures and its thresholds to fit their research task and provides a data summary based on the customized definition. Using the Fitness-for-Use Tool, study participants were able to accurately complete fitness-for-use assessment in less time than when using the Intrinsic DQ Tool. The study participants perceived that the Fitness-for-Use Tool was more useful in determining the fitness-for-use of a dataset than the Intrinsic DQ Tool.
Incorporating fitness-for-use measures in a DQ characterization tool could provide data summary that meets researchers needs. The design features identified in this study has potential to be applied to other biomedical data types.
A tool that summarizes a dataset in terms of fitness-for-use dimensions and measures specific to a research question supports dataset assessment better than a tool that only presents information on intrinsic DQ measures.
设计和评估一种交互式数据质量(DQ)特征描述工具,该工具专注于可用性完整性措施,以支持研究人员对数据集的评估。
通过 DQ 概念框架、文献回顾和访谈确定了设计要求。根据收集到的要求开发了工具原型,并由领域专家进一步改进。通过一项内部对照实验对可用性工具进行了评估,该实验将其与提供基于内在 DQ 措施的缺失数据信息的基线工具进行了比较。在任务绩效和感知可用性方面评估了这些工具。
可用性工具允许用户通过自定义措施及其阈值来定义数据完整性,以适应其研究任务,并根据自定义定义提供数据摘要。使用可用性工具,研究参与者能够在比使用内在 DQ 工具更短的时间内准确完成可用性评估。研究参与者认为,可用性工具比内在 DQ 工具更有助于确定数据集的可用性。
在 DQ 特征描述工具中纳入可用性措施可以提供满足研究人员需求的数据摘要。本研究中确定的设计特点有可能应用于其他生物医学数据类型。
一种能够根据特定研究问题的可用性维度和措施总结数据集的工具,比仅提供内在 DQ 措施信息的工具更能支持数据集评估。