Jarmakovica Agate
Faculty of Computer Science, Information Technology and Energy, Riga Technical University, Riga, Latvia.
Front Artif Intell. 2025 Jul 21;8:1621514. doi: 10.3389/frai.2025.1621514. eCollection 2025.
Healthcare data quality is a critical factor in clinical decision-making, diagnostic accuracy, and the overall efficacy of healthcare systems. This study addresses key challenges such as missing values and anomalies in healthcare datasets, which can result in misdiagnoses and inefficient resource use. The objective is to develop and evaluate a machine learning-based strategy to improve healthcare data quality, with a focus on three core dimensions: accuracy, completeness, and reusability. A publicly available diabetes dataset comprising 768 records and 9 variables was used. The methodology involved a comprehensive data preprocessing workflow, including data acquisition, cleaning, and exploratory analysis using established Python tools. Missing values were addressed using K-nearest neighbors imputation, while anomaly detection was performed using ensemble techniques. Principal Component Analysis (PCA) and correlation analysis were applied to identify key predictors of diabetes, such as Glucose, BMI, and Age. The results showed significant improvements in data completeness (from 90.57% to nearly 100%), better accuracy by mitigating anomalies, and enhanced reusability for downstream machine learning tasks. In predictive modeling, Random Forest outperformed LightGBM, achieving an accuracy of 75.3% and an AUC of 0.83. The process was fully documented, and reproducibility tools were integrated to ensure the methodology could be replicated and extended. These findings demonstrate the potential of machine learning to support robust data quality improvement frameworks in healthcare, ultimately contributing to better clinical outcomes and predictive capabilities.
医疗保健数据质量是临床决策、诊断准确性以及医疗保健系统整体效能的关键因素。本研究解决了医疗保健数据集中诸如缺失值和异常等关键挑战,这些挑战可能导致误诊和资源利用效率低下。目标是开发并评估一种基于机器学习的策略来提高医疗保健数据质量,重点关注三个核心维度:准确性、完整性和可重用性。使用了一个包含768条记录和9个变量的公开可用糖尿病数据集。该方法涉及一个全面的数据预处理工作流程,包括使用既定的Python工具进行数据采集、清理和探索性分析。使用K近邻插补法处理缺失值,同时使用集成技术进行异常检测。应用主成分分析(PCA)和相关性分析来识别糖尿病的关键预测因素,如血糖、体重指数和年龄。结果显示数据完整性有显著改善(从90.57%提高到近100%),通过减轻异常提高了准确性,并增强了下游机器学习任务的可重用性。在预测建模中,随机森林的表现优于LightGBM,准确率达到75.3%,曲线下面积(AUC)为0.83。该过程有完整记录,并集成了可重复性工具以确保该方法能够被复制和扩展。这些发现证明了机器学习在支持医疗保健领域强大的数据质量改进框架方面的潜力,最终有助于实现更好的临床结果和预测能力。