基于机器学习的提高医疗数据质量的策略：准确性、完整性和可重用性评估

Machine learning-based strategies for improving healthcare data quality: an evaluation of accuracy, completeness, and reusability.

作者信息

Jarmakovica Agate

机构信息

Faculty of Computer Science, Information Technology and Energy, Riga Technical University, Riga, Latvia.

出版信息

Front Artif Intell. 2025 Jul 21;8:1621514. doi: 10.3389/frai.2025.1621514. eCollection 2025.

DOI:10.3389/frai.2025.1621514

PMID:40761812

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12319021/

Abstract

Healthcare data quality is a critical factor in clinical decision-making, diagnostic accuracy, and the overall efficacy of healthcare systems. This study addresses key challenges such as missing values and anomalies in healthcare datasets, which can result in misdiagnoses and inefficient resource use. The objective is to develop and evaluate a machine learning-based strategy to improve healthcare data quality, with a focus on three core dimensions: accuracy, completeness, and reusability. A publicly available diabetes dataset comprising 768 records and 9 variables was used. The methodology involved a comprehensive data preprocessing workflow, including data acquisition, cleaning, and exploratory analysis using established Python tools. Missing values were addressed using K-nearest neighbors imputation, while anomaly detection was performed using ensemble techniques. Principal Component Analysis (PCA) and correlation analysis were applied to identify key predictors of diabetes, such as Glucose, BMI, and Age. The results showed significant improvements in data completeness (from 90.57% to nearly 100%), better accuracy by mitigating anomalies, and enhanced reusability for downstream machine learning tasks. In predictive modeling, Random Forest outperformed LightGBM, achieving an accuracy of 75.3% and an AUC of 0.83. The process was fully documented, and reproducibility tools were integrated to ensure the methodology could be replicated and extended. These findings demonstrate the potential of machine learning to support robust data quality improvement frameworks in healthcare, ultimately contributing to better clinical outcomes and predictive capabilities.

摘要

医疗保健数据质量是临床决策、诊断准确性以及医疗保健系统整体效能的关键因素。本研究解决了医疗保健数据集中诸如缺失值和异常等关键挑战，这些挑战可能导致误诊和资源利用效率低下。目标是开发并评估一种基于机器学习的策略来提高医疗保健数据质量，重点关注三个核心维度：准确性、完整性和可重用性。使用了一个包含768条记录和9个变量的公开可用糖尿病数据集。该方法涉及一个全面的数据预处理工作流程，包括使用既定的Python工具进行数据采集、清理和探索性分析。使用K近邻插补法处理缺失值，同时使用集成技术进行异常检测。应用主成分分析（PCA）和相关性分析来识别糖尿病的关键预测因素，如血糖、体重指数和年龄。结果显示数据完整性有显著改善（从90.57%提高到近100%），通过减轻异常提高了准确性，并增强了下游机器学习任务的可重用性。在预测建模中，随机森林的表现优于LightGBM，准确率达到75.3%，曲线下面积（AUC）为0.83。该过程有完整记录，并集成了可重复性工具以确保该方法能够被复制和扩展。这些发现证明了机器学习在支持医疗保健领域强大的数据质量改进框架方面的潜力，最终有助于实现更好的临床结果和预测能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cb5/12319021/eba2682041c9/frai-08-1621514-g001.jpg

相似文献

Machine learning-based strategies for improving healthcare data quality: an evaluation of accuracy, completeness, and reusability.

Front Artif Intell. 2025 Jul 21;8:1621514. doi: 10.3389/frai.2025.1621514. eCollection 2025.

Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.

Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study.

J Med Internet Res. 2025 May 26;27:e66733. doi: 10.2196/66733.

Interventions to improve safe and effective medicines use by consumers: an overview of systematic reviews.

Cochrane Database Syst Rev. 2014 Apr 29;2014(4):CD007768. doi: 10.1002/14651858.CD007768.pub3.

Eliciting adverse effects data from participants in clinical trials.

Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Proposal for Using AI to Assess Clinical Data Integrity and Generate Metadata: Algorithm Development and Validation.

JMIR Med Inform. 2025 Jun 30;13:e60204. doi: 10.2196/60204.

Accreditation through the eyes of nurse managers: an infinite staircase or a phenomenon that evaporates like water.

J Health Organ Manag. 2025 Jun 30. doi: 10.1108/JHOM-01-2025-0029.

本文引用的文献

Comparison of imputation methods for univariate categorical longitudinal data.

Qual Quant. 2025;59(2):1767-1791. doi: 10.1007/s11135-024-02028-z. Epub 2024 Dec 26.

Machine learning models for predicting in-hospital mortality from acute pancreatitis in intensive care unit.

BMC Med Inform Decis Mak. 2025 May 27;25(1):198. doi: 10.1186/s12911-025-03033-4.

A Survey of Data Quality Measurement and Monitoring Tools.

Front Big Data. 2022 Mar 31;5:850611. doi: 10.3389/fdata.2022.850611. eCollection 2022.

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective.

SN Comput Sci. 2021;2(5):377. doi: 10.1007/s42979-021-00765-8. Epub 2021 Jul 12.

Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R.

BMC Med Res Methodol. 2021 Apr 2;21(1):63. doi: 10.1186/s12874-021-01252-7.

Machine learning for metabolic engineering: A review.

Metab Eng. 2021 Jan;63:34-60. doi: 10.1016/j.ymben.2020.10.005. Epub 2020 Nov 19.

Stacked generalization: an introduction to super learning.

Eur J Epidemiol. 2018 May;33(5):459-464. doi: 10.1007/s10654-018-0390-z. Epub 2018 Apr 10.

Principal component analysis: a review and recent developments.

Philos Trans A Math Phys Eng Sci. 2016 Apr 13;374(2065):20150202. doi: 10.1098/rsta.2015.0202.

MRI noise estimation and denoising using non-local PCA.

Med Image Anal. 2015 May;22(1):35-47. doi: 10.1016/j.media.2015.01.004. Epub 2015 Feb 7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr
超能文献

基于机器学习的提高医疗数据质量的策略：准确性、完整性和可重用性评估

Machine learning-based strategies for improving healthcare data quality: an evaluation of accuracy, completeness, and reusability.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr超能文献

基于机器学习的提高医疗数据质量的策略：准确性、完整性和可重用性评估

Machine learning-based strategies for improving healthcare data quality: an evaluation of accuracy, completeness, and reusability.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr
超能文献