Suppr超能文献

在存在大量缺失标签的电子健康记录中估计糖尿病视网膜病变的患病率。

Estimating the prevalence of diabetic retinopathy in electronic health records with massive missing labels.

作者信息

Liang Ye, Wang Ru, Wang Yuchen, Liu Tieming

机构信息

Department of Statistics, Oklahoma State University, Stillwater, OK, USA.

Dell Technologies, Round Rock, TX, USA.

出版信息

Intell Based Med. 2024;10. doi: 10.1016/j.ibmed.2024.100154. Epub 2024 Jul 5.

Abstract

OBJECTIVE

The paper aims to address the problem of massive unlabeled patients in electronic health records (EHR) who potentially have undiagnosed diabetic retinopathy (DR). It is desired to estimate the actual DR prevalence in EHR with 96 % missing labels.

MATERIALS AND METHODS

The Cerner Health Facts data are used in the study, with 3749 labeled DR patients and 97,876 unlabeled diabetic patients. This extensive dataset spans the demographics of the United States over the past two decades. We implemented state-of-art positive-unlabeled learning methods, including ensemble-based support vector machine, ensemble-based random forest, and Bayesian finite mixture modeling.

RESULTS

The estimated DR prevalence in the population represented by Cerner EHR is approximately 25 % and the classification techniques generally achieve an AUC of around 87 %. As a by-product, a predictive inference on the risk of DR based on a patient's personalized medical information is derived.

DISCUSSION

Missing labels is a common issue for EHR data quality. Ignoring these missing labels can lead to biased results in the analyses of EHR data. The problem is especially severe in the context of DR. It is thus important to use machine learning or statistical tools to identify the unlabeled patients. The tool in this paper helps both data analysts and clinicians in their practices.

摘要

目的

本文旨在解决电子健康记录(EHR)中大量未标记患者的问题,这些患者可能患有未确诊的糖尿病视网膜病变(DR)。期望在96%标签缺失的情况下估计EHR中DR的实际患病率。

材料与方法

本研究使用了Cerner健康事实数据,其中有3749名标记了DR的患者和97876名未标记的糖尿病患者。这个庞大的数据集涵盖了过去二十年美国的人口统计数据。我们实施了先进的正无标记学习方法,包括基于集成的支持向量机、基于集成的随机森林和贝叶斯有限混合模型。

结果

Cerner EHR所代表的人群中估计的DR患病率约为25%,分类技术通常实现的曲线下面积(AUC)约为87%。作为副产品,基于患者的个性化医疗信息得出了对DR风险的预测推断。

讨论

标签缺失是EHR数据质量的常见问题。在EHR数据分析中忽略这些缺失标签会导致有偏差的结果。在DR的背景下,这个问题尤其严重。因此,使用机器学习或统计工具来识别未标记患者很重要。本文中的工具对数据分析师和临床医生的实践都有帮助。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c6e/11666125/7698d0835881/nihms-2042093-f0001.jpg

相似文献

4
Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究
J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

本文引用的文献

2
Improving Access to Eye Care: A Systematic Review of the Literature.改善眼科保健服务可及性的系统评价文献研究。
Ophthalmology. 2022 Oct;129(10):e114-e126. doi: 10.1016/j.ophtha.2022.07.012. Epub 2022 Sep 1.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验