一种基于数据驱动的学习方法，用于评估数据质量。

A data driven learning approach for the assessment of data quality.

机构信息

Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Carl-Neuberg-Str. 1, 30625, Hannover, Germany.

出版信息

BMC Med Inform Decis Mak. 2021 Nov 1;21(1):302. doi: 10.1186/s12911-021-01656-x.

DOI:10.1186/s12911-021-01656-x

PMID:34724930

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8561935/

Abstract

BACKGROUND

Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results indicate data quality issues have considerable limitations, e.g. to identify task dependent thresholds for measurement results that indicate data quality issues.

OBJECTIVES

To explore the applicability and potential benefits of a data driven approach to learn task dependent knowledge about suitable measurement methods and assessment of their results. Such knowledge could be useful for others to determine whether a local data stock is suitable for a given task.

METHODS

We started by creating artificial data with previously defined data quality issues and applied a set of generic measurement methods on this data (e.g. a method to count the number of values in a certain variable or the mean value of the values). We trained decision trees on exported measurement methods' results and corresponding outcome data (data that indicated the data's suitability for a use case). For evaluation, we derived rules for potential measurement methods and reference values from the decision trees and compared these regarding their coverage of the true data quality issues artificially created in the dataset. Three researchers independently derived these rules. One with knowledge about present data quality issues and two without.

RESULTS

Our self-trained decision trees were able to indicate rules for 12 of 19 previously defined data quality issues. Learned knowledge about measurement methods and their assessment was complementary to manual interpretation of measurement methods' results.

CONCLUSIONS

Our data driven approach derives sensible knowledge for task dependent data quality assessment and complements other current approaches. Based on labeled measurement methods' results as training data, our approach successfully suggested applicable rules for checking data quality characteristics that determine whether a dataset is suitable for a given task.

摘要

背景

数据质量评估非常重要，但复杂且依赖任务。确定适合评估其结果的测量方法和参考范围具有挑战性。手动检查测量结果和当前的数据驱动方法来学习哪些结果表示数据质量问题具有很大的局限性，例如，确定指示数据质量问题的测量结果的任务相关阈值。

目的

探索数据驱动方法在学习适合特定任务的测量方法和评估其结果的任务相关知识方面的适用性和潜在优势。这些知识可用于帮助其他人确定本地数据是否适用于给定任务。

方法

我们首先使用先前定义的数据质量问题创建人工数据，并在该数据上应用一组通用的测量方法（例如，一种用于计算特定变量中值的数量或值的平均值的方法）。我们根据导出的测量方法的结果和相应的结果数据（表示数据是否适合用例的数据）对决策树进行了训练。为了评估，我们从决策树中得出了潜在测量方法和参考值的规则，并比较了这些规则在覆盖数据集中原先创建的人为数据质量问题方面的覆盖程度。三位研究人员分别独立得出了这些规则。其中一位研究人员具有当前数据质量问题的知识，另外两位没有。

结果

我们的自训练决策树能够指示 19 个先前定义的数据质量问题中的 12 个问题的规则。关于测量方法及其评估的学习知识补充了对测量方法结果的手动解释。

结论

我们的数据驱动方法为依赖任务的数据质量评估提供了合理的知识，并补充了其他当前方法。基于标记的测量方法的结果作为训练数据，我们的方法成功地为检查数据质量特征提供了适用的规则，这些特征决定了数据集是否适合给定任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e0c1/8561935/6858b167f6e2/12911_2021_1656_Fig1_HTML.jpg

相似文献

A data driven learning approach for the assessment of data quality.一种基于数据驱动的学习方法，用于评估数据质量。

BMC Med Inform Decis Mak. 2021 Nov 1;21(1):302. doi: 10.1186/s12911-021-01656-x.

Targeted Data Quality Analysis for a Clinical Decision Support System for SIRS Detection in Critically Ill Pediatric Patients.针对危重症儿科患者 SIRS 检测的临床决策支持系统的数据质量分析

Methods Inf Med. 2023 Jun;62(S 01):e1-e9. doi: 10.1055/s-0042-1760238. Epub 2023 Jan 11.

The effectiveness of internet-based e-learning on clinician behavior and patient outcomes: a systematic review protocol.基于互联网的电子学习对临床医生行为和患者结局的有效性：一项系统评价方案。

JBI Database System Rev Implement Rep. 2015 Jan;13(1):52-64. doi: 10.11124/jbisrir-2015-1919.

The future of Cochrane Neonatal.考克兰新生儿协作网的未来。

Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

A method for interoperable knowledge-based data quality assessment.一种基于知识的可互操作的数据质量评估方法。

BMC Med Inform Decis Mak. 2021 Mar 9;21(1):93. doi: 10.1186/s12911-021-01458-1.

The impact of the congestion charging scheme on air quality in London. Part 1. Emissions modeling and analysis of air pollution measurements.拥堵收费计划对伦敦空气质量的影响。第1部分。排放建模与空气污染测量分析。

Res Rep Health Eff Inst. 2011 Apr(155):5-71.

Ultrasound Shear Wave Elastography for Liver Disease. A Critical Appraisal of the Many Actors on the Stage.用于肝脏疾病的超声剪切波弹性成像：对该领域众多参与者的批判性评估

Ultraschall Med. 2016 Feb;37(1):1-5. doi: 10.1055/s-0035-1567037. Epub 2016 Feb 12.

Striving for Use Case Specific Optimization of Data Quality Assessment for Health Data.致力于针对健康数据的数据质量评估进行特定用例优化。

Stud Health Technol Inform. 2018;251:113-116.

Recommendations on evidence needed to support measurement equivalence between electronic and paper-based patient-reported outcome (PRO) measures: ISPOR ePRO Good Research Practices Task Force report.关于支持电子和纸质患者报告结局（PRO）测量等效性所需证据的建议：国际药物经济学与结果研究协会（ISPOR）电子PRO良好研究实践工作组报告

Value Health. 2009 Jun;12(4):419-29. doi: 10.1111/j.1524-4733.2008.00470.x. Epub 2008 Nov 11.

引用本文的文献

The development and evaluation of a quality assessment framework for reuse of dietary intake data: an FNS-Cloud study.膳食摄入数据再利用质量评估框架的开发与评估：一项FNS-Cloud研究

Front Nutr. 2025 Jun 6;12:1519401. doi: 10.3389/fnut.2025.1519401. eCollection 2025.

The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.用于评估生物医学数据集质量和可信度的维纳斯评分。

BioData Min. 2025 Jan 9;18(1):1. doi: 10.1186/s13040-024-00412-x.

Methods Inf Med. 2023 Jun;62(S 01):e1-e9. doi: 10.1055/s-0042-1760238. Epub 2023 Jan 11.

本文引用的文献

A method for interoperable knowledge-based data quality assessment.一种基于知识的可互操作的数据质量评估方法。

BMC Med Inform Decis Mak. 2021 Mar 9;21(1):93. doi: 10.1186/s12911-021-01458-1.

Improving a Secondary Use Health Data Warehouse: Proposing a Multi-Level Data Quality Framework.改进二级使用健康数据仓库：提出一个多层次数据质量框架。

EGEMS (Wash DC). 2019 Aug 2;7(1):38. doi: 10.5334/egems.298.

Design and Refinement of a Data Quality Assessment Workflow for a Large Pediatric Research Network.大型儿科研究网络数据质量评估工作流程的设计与优化

EGEMS (Wash DC). 2019 Aug 1;7(1):36. doi: 10.5334/egems.294.

Moving Towards an EHR Data Quality Framework: The MIRACUM Approach.迈向电子健康记录数据质量框架：MIRACUM方法。

Stud Health Technol Inform. 2019 Sep 3;267:247-253. doi: 10.3233/SHTI190834.

Guest editorial: Special issue in biomedical data quality assessment methods.客座编辑按语：生物医学数据质量评估方法特刊

Comput Methods Programs Biomed. 2019 Nov;181:104954. doi: 10.1016/j.cmpb.2019.06.013. Epub 2019 Jun 13.

German Medical Informatics Initiative.德国医学信息学倡议组织

Methods Inf Med. 2018 Jul;57(S 01):e50-e56. doi: 10.3414/ME18-03-0003. Epub 2018 Jul 17.

A Data Quality Assessment Guideline for Electronic Health Record Data Reuse.电子健康记录数据复用的数据质量评估指南

EGEMS (Wash DC). 2017 Sep 4;5(1):14. doi: 10.5334/egems.218.

Quantifying the Effect of Data Quality on the Validity of an eMeasure.量化数据质量对电子测量有效性的影响。

Appl Clin Inform. 2017 Oct;8(4):1012-1021. doi: 10.4338/ACI-2017-03-RA-0042. Epub 2017 Dec 14.

Secondary Use and Analysis of Big Data Collected for Patient Care.用于患者护理的大数据的二次利用与分析。

Yearb Med Inform. 2017 Aug;26(1):28-37. doi: 10.15265/IY-2017-008. Epub 2017 Sep 11.

A longitudinal analysis of data quality in a large pediatric data research network.大型儿科数据研究网络中数据质量的纵向分析。

J Am Med Inform Assoc. 2017 Nov 1;24(6):1072-1079. doi: 10.1093/jamia/ocx033.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种基于数据驱动的学习方法，用于评估数据质量。

A data driven learning approach for the assessment of data quality.

机构信息

出版信息

BACKGROUND

OBJECTIVES

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献