Institute for Community Medicine, Department SHIP-KEF, University Medicine Greifswald, Greifswald, Germany.
Institute for Medical Informatics, Statistics, and Epidemiology, University of Leipzig, Leipzig, Germany.
BMC Med Res Methodol. 2021 Apr 2;21(1):63. doi: 10.1186/s12874-021-01252-7.
No standards exist for the handling and reporting of data quality in health research. This work introduces a data quality framework for observational health research data collections with supporting software implementations to facilitate harmonized data quality assessments.
Developments were guided by the evaluation of an existing data quality framework and literature reviews. Functions for the computation of data quality indicators were written in R. The concept and implementations are illustrated based on data from the population-based Study of Health in Pomerania (SHIP).
The data quality framework comprises 34 data quality indicators. These target four aspects of data quality: compliance with pre-specified structural and technical requirements (integrity); presence of data values (completeness); inadmissible or uncertain data values and contradictions (consistency); unexpected distributions and associations (accuracy). R functions calculate data quality metrics based on the provided study data and metadata and R Markdown reports are generated. Guidance on the concept and tools is available through a dedicated website.
The presented data quality framework is the first of its kind for observational health research data collections that links a formal concept to implementations in R. The framework and tools facilitate harmonized data quality assessments in pursue of transparent and reproducible research. Application scenarios comprise data quality monitoring while a study is carried out as well as performing an initial data analysis before starting substantive scientific analyses but the developments are also of relevance beyond research.
目前,健康研究领域还没有关于数据质量处理和报告的标准。本研究旨在为观察性健康研究数据收集引入一个数据质量框架,并提供相应的软件实现,以促进数据质量评估的协调一致。
本研究在评估现有数据质量框架和文献综述的基础上进行了开发。用于计算数据质量指标的功能是用 R 编写的。该概念和实现以基于基于人群的什未林健康研究(SHIP)的数据为例进行了说明。
该数据质量框架包含 34 个数据质量指标。这些指标针对数据质量的四个方面:符合预定的结构和技术要求(完整性);存在数据值(完整性);不可接受或不确定的数据值和矛盾(一致性);意外的分布和关联(准确性)。R 函数根据提供的研究数据和元数据计算数据质量指标,并生成 R Markdown 报告。通过专门的网站提供有关概念和工具的指南。
本研究提出的数据质量框架是首个针对观察性健康研究数据收集的框架,将正式概念与 R 中的实现联系起来。该框架和工具有助于协调数据质量评估,以追求透明和可重复的研究。应用场景包括在研究进行过程中进行数据质量监测,以及在开始实质性科学分析之前进行初步数据分析,但这些开发也超出了研究范围具有重要意义。