挖掘公平健康：评估电子健康记录中缺失数据的影响。

Mining for equitable health: Assessing the impact of missing data in electronic health records.

机构信息

Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, United States.

Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, United States.

出版信息

J Biomed Inform. 2023 Mar;139:104269. doi: 10.1016/j.jbi.2022.104269. Epub 2023 Jan 5.

DOI:10.1016/j.jbi.2022.104269

PMID:36621750

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10391553/

Abstract

Electronic health records (EHR) are collected as a routine part of healthcare delivery, and have great potential to be utilized to improve patient health outcomes. They contain multiple years of health information to be leveraged for risk prediction, disease detection, and treatment evaluation. However, they do not have a consistent, standardized format across institutions, particularly in the United States, and can present significant analytical challenges- they contain multi-scale data from heterogeneous domains and include both structured and unstructured data. Data for individual patients are collected at irregular time intervals and with varying frequencies. In addition to the analytical challenges, EHR can reflect inequity- patients belonging to different groups will have differing amounts of data in their health records. Many of these issues can contribute to biased data collection. The consequence is that the data for under-served groups may be less informative partly due to more fragmented care, which can be viewed as a type of missing data problem. For EHR data in this complex form, there is currently no framework for introducing realistic missing values. There has also been little to no work in assessing the impact of missing data in EHR. In this work, we first introduce a terminology to define three levels of EHR data and then propose a novel framework for simulating realistic missing data scenarios in EHR to adequately assess their impact on predictive modeling. We incorporate the use of a medical knowledge graph to capture dependencies between medical events to create a more realistic missing data framework. In an intensive care unit setting, we found that missing data have greater negative impact on the performance of disease prediction models in groups that tend to have less access to healthcare, or seek less healthcare. We also found that the impact of missing data on disease prediction models is stronger when using the knowledge graph framework to introduce realistic missing values as opposed to random event removal.

摘要

电子健康记录 (EHR) 是医疗保健提供的常规组成部分，具有很大的潜力可用于改善患者的健康结果。它们包含多年的健康信息，可用于风险预测、疾病检测和治疗评估。然而，它们在机构之间没有一致的、标准化的格式，特别是在美国，并且会带来重大的分析挑战——它们包含来自异构领域的多尺度数据，并包括结构化和非结构化数据。个体患者的数据是在不规则的时间间隔和不同的频率下收集的。除了分析挑战之外，EHR 还可能反映出不公平现象——属于不同群体的患者在其健康记录中拥有不同数量的数据。这些问题中的许多都可能导致数据收集存在偏差。其结果是，服务不足群体的数据可能不那么有信息量，部分原因是护理更加碎片化，这可以被视为一种缺失数据问题。对于这种复杂形式的 EHR 数据，目前还没有引入真实缺失值的框架。在 EHR 中，也几乎没有工作来评估缺失数据的影响。在这项工作中，我们首先引入了一个术语来定义 EHR 数据的三个级别，然后提出了一种新的框架，用于模拟 EHR 中的真实缺失数据场景，以充分评估它们对预测建模的影响。我们结合使用医疗知识图谱来捕获医疗事件之间的依赖关系，以创建更真实的缺失数据框架。在重症监护病房环境中，我们发现缺失数据对疾病预测模型在那些倾向于较少获得医疗保健或较少寻求医疗保健的群体中的性能具有更大的负面影响。我们还发现，当使用知识图谱框架引入真实缺失值而不是随机事件删除时，缺失数据对疾病预测模型的影响更强。

相似文献

Mining for equitable health: Assessing the impact of missing data in electronic health records.

J Biomed Inform. 2023 Mar;139:104269. doi: 10.1016/j.jbi.2022.104269. Epub 2023 Jan 5.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Adult patient access to electronic health records.

Cochrane Database Syst Rev. 2021 Feb 26;2(2):CD012707. doi: 10.1002/14651858.CD012707.pub2.

Electronic Health Record-Oriented Knowledge Graph System for Collaborative Clinical Decision Support Using Multicenter Fragmented Medical Data: Design and Application Study.

J Med Internet Res. 2024 Jul 5;26:e54263. doi: 10.2196/54263.

Preprocessing structured clinical data for predictive modeling and decision support. A roadmap to tackle the challenges.

Appl Clin Inform. 2016 Dec 7;7(4):1135-1153. doi: 10.4338/ACI-2016-03-SOA-0035.

Integration of genetic and clinical information to improve imputation of data missing from electronic health records.

J Am Med Inform Assoc. 2019 Oct 1;26(10):1056-1063. doi: 10.1093/jamia/ocz041.

Question Answering for Electronic Health Records: Scoping Review of Datasets and Models.

J Med Internet Res. 2024 Oct 30;26:e53636. doi: 10.2196/53636.

Missing clinical and behavioral health data in a large electronic health record (EHR) system.

J Am Med Inform Assoc. 2016 Nov;23(6):1143-1149. doi: 10.1093/jamia/ocw021. Epub 2016 Apr 14.

Multi-task heterogeneous graph learning on electronic health records.

Neural Netw. 2024 Dec;180:106644. doi: 10.1016/j.neunet.2024.106644. Epub 2024 Aug 22.

EHR-BERT: A BERT-based model for effective anomaly detection in electronic health records.

J Biomed Inform. 2024 Feb;150:104605. doi: 10.1016/j.jbi.2024.104605. Epub 2024 Feb 6.

引用本文的文献

Benchmarking Missing Data Imputation Methods for Time Series Using Real-World Test Cases.

Proc Mach Learn Res. 2025 Jun;287:480-501.

Misleading Results in Posttraumatic Stress Disorder Predictive Models Using Electronic Health Record Data: Algorithm Validation Study.

J Med Internet Res. 2025 Aug 27;27:e63352. doi: 10.2196/63352.

Using Real-World Data on Depression from EHR-based Research Networks: A Scoping Review.

Res Sq. 2025 Aug 5:rs.3.rs-7272352. doi: 10.21203/rs.3.rs-7272352/v1.

Systems Factors Contributing to Racial/Ethnic Disparities in Maternal Health: A Systematic Review.

J Racial Ethn Health Disparities. 2025 Aug 11. doi: 10.1007/s40615-025-02583-7.

Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records.

Digit Health. 2025 Jul 29;11:20552076251352436. doi: 10.1177/20552076251352436. eCollection 2025 Jan-Dec.

Using electronic health records to understand multimorbidity in older people: a scoping review.

Eur Geriatr Med. 2025 Jul 3. doi: 10.1007/s41999-025-01231-x.

Prediction tool for discharge disposition and 30-day readmission using electronic health records among patients hospitalized for traumatic brain injury.

Front Neurol. 2025 Jun 16;16:1581176. doi: 10.3389/fneur.2025.1581176. eCollection 2025.

A probabilistic approach for building disease phenotypes across electronic health records.

BioData Min. 2025 Jun 11;18(1):39. doi: 10.1186/s13040-025-00454-9.

Data Missingness and Equity Implications in the Nation's Largest Student Fitness Surveillance System: The New York City School Based Physical Fitness Testing Programs, 2006-2020.

J Sch Health. 2025 Jul;95(7):498-509. doi: 10.1111/josh.70021. Epub 2025 May 19.

Mitigating Bias in Machine Learning Models with Ethics-Based Initiatives: The Case of Sepsis.

Am J Bioeth. 2025 May 12:1-14. doi: 10.1080/15265161.2025.2497971.

本文引用的文献

A knowledge graph to interpret clinical proteomics data.

Nat Biotechnol. 2022 May;40(5):692-702. doi: 10.1038/s41587-021-01145-6. Epub 2022 Jan 31.

A novel tool for standardizing clinical data in a semantically rich model.

J Biomed Inform. 2020;112S:100086. doi: 10.1016/j.yjbinx.2020.100086. Epub 2020 Sep 19.

A Review of Challenges and Opportunities in Machine Learning for Health.

AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:191-200. eCollection 2020.

Deep learning for electronic health records: A comparative review of multiple deep neural architectures.

J Biomed Inform. 2020 Jan;101:103337. doi: 10.1016/j.jbi.2019.103337.

Ensuring Fairness in Machine Learning to Advance Health Equity.

Ann Intern Med. 2018 Dec 18;169(12):866-872. doi: 10.7326/M18-1990. Epub 2018 Dec 4.

A Bayesian latent class approach for EHR-based phenotyping.

Stat Med. 2019 Jan 15;38(1):74-87. doi: 10.1002/sim.7953. Epub 2018 Sep 3.

Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data.

JAMA Intern Med. 2018 Nov 1;178(11):1544-1547. doi: 10.1001/jamainternmed.2018.3763.

Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.

JMIR Med Inform. 2018 Feb 23;6(1):e11. doi: 10.2196/medinform.8960.

Biases introduced by filtering electronic health records for patients with "complete data".

J Am Med Inform Assoc. 2017 Nov 1;24(6):1134-1141. doi: 10.1093/jamia/ocx071.

Learning a Health Knowledge Graph from Electronic Medical Records.

Sci Rep. 2017 Jul 20;7(1):5994. doi: 10.1038/s41598-017-05778-z.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

挖掘公平健康：评估电子健康记录中缺失数据的影响。

Mining for equitable health: Assessing the impact of missing data in electronic health records.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献