关于使用人工智能评估临床数据完整性并生成元数据的提案：算法开发与验证

Bönisch Caroline, Schmidt Christian, Kesztyüs Dorothea, Kestler Hans A, Kesztyüs Tibor

Department of Electrical Engineering and Informatics, University of Applied Sciences Stralsund, Zur Schwedenschanze 15, Stralsund, 18435, Germany, 49 3831 45 6505.

Medical Data Integration Center Göttingen, University Medical Center Göttingen, Göttingen, Germany.

JMIR Med Inform. 2025 Jun 30;13:e60204. doi: 10.2196/60204.

BACKGROUND

Evidence-based medicine combines scientific research, clinical expertise, and patient preferences to enhance the patient outcomes and improve health care quality. Clinical data are crucial in aligning medical decisions with evidence-based practices, whether derived from systematic research or real-world data sources. Quality assurance of clinical data, mainly through predictive quality algorithms and machine learning, is essential to mitigate risks such as misdiagnosis, inappropriate treatment, bias, and compromised patient safety. Furthermore, excellent quality of clinical data is a prerequisite for the replication of research results in order to gain insights from practice and real-world evidence.

OBJECTIVE

This study aims to demonstrate the varying quality of medical data in primary clinical source systems at a maximum care university hospital and provide researchers with insights into data reliability through predictive quality algorithms using machine learning techniques.

METHODS

A literature review was conducted to evaluate existing approaches to automated quality prediction. In addition, embedded in the process of integrating care data into a medical data integration center (MeDIC), metadata relevant to this clinical data was stored, considering factors such as data granularity and quality metrics. Completed patient cases with echocardiographic and laboratory findings as well as medication histories were selected from 2001 to 2023. Two authors manually reviewed the datasets and assigned a quality score for each entry, with 0 indicating unsatisfactory and 1 satisfactory quality. Since quality control was considered a binary problem, corresponding classifiers were used for the quality prediction. Logistic regression, k-nearest neighbors, a naive bayes classifier, a decision tree classifier, a random forest classifier, extreme gradient boosting (XGB), and support vector machines (SVM) were selected as machine learning algorithms. Based on preprocessing the dataset, training machine learning algorithms on echocardiographic, laboratory, and medication data, and assessing various prediction models, the most effective algorithms for quality classification were to be identified. The performance of the predictive quality algorithms was assessed based on accuracy, precision, recall, and scoring.

RESULTS

There were 450 patient cases with complete information extracted from the MeDIC data pool. The laboratory and medication datasets had to be limited to 4000 data entries each to enable manual review; the echocardiographic datasets comprised 750 examinations. XGB demonstrated the highest performance for the echocardiographic dataset with an area under the receiver operating characteristic curve (AUC-ROC) of 84.6%. For laboratory data, SVM achieved an AUC-ROC score of 89.8%, demonstrating superior discrimination performance. Finally, regarding the medication dataset, SVM showed the most balanced performance, achieving an AUC-ROC of 65.1%, the highest of all tested models.

CONCLUSIONS

This proposal presents a template for predicting data quality and incorporating the resulting quality information into the metadata of a data integration center, a concept not previously implemented. The model was deployed for data inspection using a hybrid approach that combines the trained model with conventional inspection methods.

背景

循证医学将科学研究、临床专业知识和患者偏好相结合，以提高患者治疗效果并改善医疗质量。临床数据对于使医疗决策与循证实践保持一致至关重要，无论这些数据来自系统研究还是真实世界的数据来源。临床数据的质量保证主要通过预测性质量算法和机器学习来实现，这对于降低误诊、不适当治疗、偏差和患者安全受损等风险至关重要。此外，高质量的临床数据是复制研究结果的先决条件，以便从实践和真实世界证据中获得见解。

目的

本研究旨在展示一所大型教学医院主要临床源系统中医疗数据质量的差异，并通过使用机器学习技术的预测性质量算法为研究人员提供数据可靠性方面的见解。

方法

进行文献综述以评估现有的自动质量预测方法。此外，在将护理数据集成到医疗数据集成中心（MeDIC）的过程中，考虑数据粒度和质量指标等因素，存储与该临床数据相关的元数据。从2001年至2023年选取了具有超声心动图和实验室检查结果以及用药史的完整患者病例。两位作者手动审查数据集并为每个条目分配质量分数，0表示质量不满意，1表示质量满意。由于质量控制被视为一个二元问题，因此使用相应的分类器进行质量预测。选择逻辑回归、k近邻、朴素贝叶斯分类器、决策树分类器、随机森林分类器、极端梯度提升（XGB）和支持向量机（SVM）作为机器学习算法。基于对数据集进行预处理、在超声心动图、实验室和用药数据上训练机器学习算法以及评估各种预测模型，确定最有效的质量分类算法。基于准确性、精确性、召回率和评分来评估预测性质量算法的性能。

结果

从MeDIC数据池中提取了450例具有完整信息的患者病例。实验室和用药数据集各自必须限制为4000个数据条目以便进行人工审查；超声心动图数据集包括750次检查。对于超声心动图数据集，XGB表现出最高性能，受试者工作特征曲线下面积（AUC-ROC）为84.6%。对于实验室数据，SVM的AUC-ROC得分为89.8%，显示出卓越的区分性能。最后，关于用药数据集，SVM表现出最平衡的性能，AUC-ROC为65.1%，是所有测试模型中最高的。

结论

本提议提出了一个预测数据质量并将所得质量信息纳入数据集成中心元数据的模板，这是一个以前未实施的概念。该模型使用一种将训练好的模型与传统检查方法相结合的混合方法部署用于数据检查。

相似文献

Proposal for Using AI to Assess Clinical Data Integrity and Generate Metadata: Algorithm Development and Validation.

JMIR Med Inform. 2025 Jun 30;13:e60204. doi: 10.2196/60204.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Artificial intelligence for diagnosing exudative age-related macular degeneration.

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study.

J Med Internet Res. 2025 May 26;27:e66733. doi: 10.2196/66733.

Artificial Intelligence-Based prediction model for surgical site infection in metastatic spinal disease: a multicenter development and validation study.

Int J Surg. 2025 Jun 27. doi: 10.1097/JS9.0000000000002806.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Health professionals' experience of teamwork education in acute hospital settings: a systematic review of qualitative literature.

JBI Database System Rev Implement Rep. 2016 Apr;14(4):96-137. doi: 10.11124/JBISRIR-2016-1843.

Research status, hotspots and perspectives of artificial intelligence applied to pain management: a bibliometric and visual analysis.

Updates Surg. 2025 Jun 28. doi: 10.1007/s13304-025-02296-w.

本文引用的文献

Evaluating health information systems-related errors using the human, organization, process, technology-fit (HOPT-fit) framework.

Health Informatics J. 2024 Apr-Jun;30(2):14604582241252763. doi: 10.1177/14604582241252763.

Data Checks Before Registering Study Protocols for Health Care Database Analyses.

JAMA. 2024 May 7;331(17):1445-1446. doi: 10.1001/jama.2024.2988.

FAIR+R: Making Clinical Data Reliable Through Qualitative Metadata.

Stud Health Technol Inform. 2024 Jan 25;310:99-103. doi: 10.3233/SHTI230935.

FAIRness through automation: development of an automated medical data integration infrastructure for FAIR health data in a maximum care university hospital.

BMC Med Inform Decis Mak. 2023 May 15;23(1):94. doi: 10.1186/s12911-023-02195-3.

Automated quality assessment of large digitised histology cohorts by artificial intelligence.

Sci Rep. 2022 Mar 23;12(1):5002. doi: 10.1038/s41598-022-08351-5.

Understanding the Nature of Metadata: Systematic Review.

J Med Internet Res. 2022 Jan 11;24(1):e25440. doi: 10.2196/25440.

A method for interoperable knowledge-based data quality assessment.

BMC Med Inform Decis Mak. 2021 Mar 9;21(1):93. doi: 10.1186/s12911-021-01458-1.

Detecting Systemic Data Quality Issues in Electronic Health Records.

Stud Health Technol Inform. 2019 Aug 21;264:383-387. doi: 10.3233/SHTI190248.

German Medical Informatics Initiative: Unlocking Data for Research and Health Care.

Methods Inf Med. 2018 Jul;57(S 01):e46-e49. doi: 10.3414/ME18-13-0001. Epub 2018 Jul 17.

Data quality: "Garbage in - garbage out".

Health Inf Manag. 2018 Sep;47(3):103-105. doi: 10.1177/1833358318774357. Epub 2018 May 2.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Proposal for Using AI to Assess Clinical Data Integrity and Generate Metadata: Algorithm Development and Validation.

JMIR Med Inform. 2025 Jun 30;13:e60204. doi: 10.2196/60204.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Artificial intelligence for diagnosing exudative age-related macular degeneration.

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study.

J Med Internet Res. 2025 May 26;27:e66733. doi: 10.2196/66733.

Artificial Intelligence-Based prediction model for surgical site infection in metastatic spinal disease: a multicenter development and validation study.

Int J Surg. 2025 Jun 27. doi: 10.1097/JS9.0000000000002806.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Health professionals' experience of teamwork education in acute hospital settings: a systematic review of qualitative literature.

JBI Database System Rev Implement Rep. 2016 Apr;14(4):96-137. doi: 10.11124/JBISRIR-2016-1843.

Research status, hotspots and perspectives of artificial intelligence applied to pain management: a bibliometric and visual analysis.

Updates Surg. 2025 Jun 28. doi: 10.1007/s13304-025-02296-w.

本文引用的文献

Evaluating health information systems-related errors using the human, organization, process, technology-fit (HOPT-fit) framework.

Health Informatics J. 2024 Apr-Jun;30(2):14604582241252763. doi: 10.1177/14604582241252763.

Data Checks Before Registering Study Protocols for Health Care Database Analyses.

JAMA. 2024 May 7;331(17):1445-1446. doi: 10.1001/jama.2024.2988.

FAIR+R: Making Clinical Data Reliable Through Qualitative Metadata.

Stud Health Technol Inform. 2024 Jan 25;310:99-103. doi: 10.3233/SHTI230935.

FAIRness through automation: development of an automated medical data integration infrastructure for FAIR health data in a maximum care university hospital.

BMC Med Inform Decis Mak. 2023 May 15;23(1):94. doi: 10.1186/s12911-023-02195-3.

Automated quality assessment of large digitised histology cohorts by artificial intelligence.

Sci Rep. 2022 Mar 23;12(1):5002. doi: 10.1038/s41598-022-08351-5.

Understanding the Nature of Metadata: Systematic Review.

J Med Internet Res. 2022 Jan 11;24(1):e25440. doi: 10.2196/25440.

A method for interoperable knowledge-based data quality assessment.

BMC Med Inform Decis Mak. 2021 Mar 9;21(1):93. doi: 10.1186/s12911-021-01458-1.

Detecting Systemic Data Quality Issues in Electronic Health Records.

Stud Health Technol Inform. 2019 Aug 21;264:383-387. doi: 10.3233/SHTI190248.

German Medical Informatics Initiative: Unlocking Data for Research and Health Care.

Methods Inf Med. 2018 Jul;57(S 01):e46-e49. doi: 10.3414/ME18-13-0001. Epub 2018 Jul 17.

Data quality: "Garbage in - garbage out".

Health Inf Manag. 2018 Sep;47(3):103-105. doi: 10.1177/1833358318774357. Epub 2018 May 2.

Proposal for Using AI to Assess Clinical Data Integrity and Generate Metadata: Algorithm Development and Validation.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献