Suppr超能文献

关于使用人工智能评估临床数据完整性并生成元数据的提案:算法开发与验证

Proposal for Using AI to Assess Clinical Data Integrity and Generate Metadata: Algorithm Development and Validation.

作者信息

Bönisch Caroline, Schmidt Christian, Kesztyüs Dorothea, Kestler Hans A, Kesztyüs Tibor

机构信息

Department of Electrical Engineering and Informatics, University of Applied Sciences Stralsund, Zur Schwedenschanze 15, Stralsund, 18435, Germany, 49 3831 45 6505.

Medical Data Integration Center Göttingen, University Medical Center Göttingen, Göttingen, Germany.

出版信息

JMIR Med Inform. 2025 Jun 30;13:e60204. doi: 10.2196/60204.

Abstract

BACKGROUND

Evidence-based medicine combines scientific research, clinical expertise, and patient preferences to enhance the patient outcomes and improve health care quality. Clinical data are crucial in aligning medical decisions with evidence-based practices, whether derived from systematic research or real-world data sources. Quality assurance of clinical data, mainly through predictive quality algorithms and machine learning, is essential to mitigate risks such as misdiagnosis, inappropriate treatment, bias, and compromised patient safety. Furthermore, excellent quality of clinical data is a prerequisite for the replication of research results in order to gain insights from practice and real-world evidence.

OBJECTIVE

This study aims to demonstrate the varying quality of medical data in primary clinical source systems at a maximum care university hospital and provide researchers with insights into data reliability through predictive quality algorithms using machine learning techniques.

METHODS

A literature review was conducted to evaluate existing approaches to automated quality prediction. In addition, embedded in the process of integrating care data into a medical data integration center (MeDIC), metadata relevant to this clinical data was stored, considering factors such as data granularity and quality metrics. Completed patient cases with echocardiographic and laboratory findings as well as medication histories were selected from 2001 to 2023. Two authors manually reviewed the datasets and assigned a quality score for each entry, with 0 indicating unsatisfactory and 1 satisfactory quality. Since quality control was considered a binary problem, corresponding classifiers were used for the quality prediction. Logistic regression, k-nearest neighbors, a naive bayes classifier, a decision tree classifier, a random forest classifier, extreme gradient boosting (XGB), and support vector machines (SVM) were selected as machine learning algorithms. Based on preprocessing the dataset, training machine learning algorithms on echocardiographic, laboratory, and medication data, and assessing various prediction models, the most effective algorithms for quality classification were to be identified. The performance of the predictive quality algorithms was assessed based on accuracy, precision, recall, and scoring.

RESULTS

There were 450 patient cases with complete information extracted from the MeDIC data pool. The laboratory and medication datasets had to be limited to 4000 data entries each to enable manual review; the echocardiographic datasets comprised 750 examinations. XGB demonstrated the highest performance for the echocardiographic dataset with an area under the receiver operating characteristic curve (AUC-ROC) of 84.6%. For laboratory data, SVM achieved an AUC-ROC score of 89.8%, demonstrating superior discrimination performance. Finally, regarding the medication dataset, SVM showed the most balanced performance, achieving an AUC-ROC of 65.1%, the highest of all tested models.

CONCLUSIONS

This proposal presents a template for predicting data quality and incorporating the resulting quality information into the metadata of a data integration center, a concept not previously implemented. The model was deployed for data inspection using a hybrid approach that combines the trained model with conventional inspection methods.

摘要

背景

循证医学将科学研究、临床专业知识和患者偏好相结合,以提高患者治疗效果并改善医疗质量。临床数据对于使医疗决策与循证实践保持一致至关重要,无论这些数据来自系统研究还是真实世界的数据来源。临床数据的质量保证主要通过预测性质量算法和机器学习来实现,这对于降低误诊、不适当治疗、偏差和患者安全受损等风险至关重要。此外,高质量的临床数据是复制研究结果的先决条件,以便从实践和真实世界证据中获得见解。

目的

本研究旨在展示一所大型教学医院主要临床源系统中医疗数据质量的差异,并通过使用机器学习技术的预测性质量算法为研究人员提供数据可靠性方面的见解。

方法

进行文献综述以评估现有的自动质量预测方法。此外,在将护理数据集成到医疗数据集成中心(MeDIC)的过程中,考虑数据粒度和质量指标等因素,存储与该临床数据相关的元数据。从2001年至2023年选取了具有超声心动图和实验室检查结果以及用药史的完整患者病例。两位作者手动审查数据集并为每个条目分配质量分数,0表示质量不满意,1表示质量满意。由于质量控制被视为一个二元问题,因此使用相应的分类器进行质量预测。选择逻辑回归、k近邻、朴素贝叶斯分类器、决策树分类器、随机森林分类器、极端梯度提升(XGB)和支持向量机(SVM)作为机器学习算法。基于对数据集进行预处理、在超声心动图、实验室和用药数据上训练机器学习算法以及评估各种预测模型,确定最有效的质量分类算法。基于准确性、精确性、召回率和评分来评估预测性质量算法的性能。

结果

从MeDIC数据池中提取了450例具有完整信息的患者病例。实验室和用药数据集各自必须限制为4000个数据条目以便进行人工审查;超声心动图数据集包括750次检查。对于超声心动图数据集,XGB表现出最高性能,受试者工作特征曲线下面积(AUC-ROC)为84.6%。对于实验室数据,SVM的AUC-ROC得分为89.8%,显示出卓越的区分性能。最后,关于用药数据集,SVM表现出最平衡的性能,AUC-ROC为65.1%,是所有测试模型中最高的。

结论

本提议提出了一个预测数据质量并将所得质量信息纳入数据集成中心元数据的模板,这是一个以前未实施的概念。该模型使用一种将训练好的模型与传统检查方法相结合的混合方法部署用于数据检查。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c411/12234397/c7e1d89a627e/medinform-v13-e60204-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验