挖掘公平健康:评估电子健康记录中缺失数据的影响。

Mining for equitable health: Assessing the impact of missing data in electronic health records.

机构信息

Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, United States.

Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, United States.

出版信息

J Biomed Inform. 2023 Mar;139:104269. doi: 10.1016/j.jbi.2022.104269. Epub 2023 Jan 5.

Abstract

Electronic health records (EHR) are collected as a routine part of healthcare delivery, and have great potential to be utilized to improve patient health outcomes. They contain multiple years of health information to be leveraged for risk prediction, disease detection, and treatment evaluation. However, they do not have a consistent, standardized format across institutions, particularly in the United States, and can present significant analytical challenges- they contain multi-scale data from heterogeneous domains and include both structured and unstructured data. Data for individual patients are collected at irregular time intervals and with varying frequencies. In addition to the analytical challenges, EHR can reflect inequity- patients belonging to different groups will have differing amounts of data in their health records. Many of these issues can contribute to biased data collection. The consequence is that the data for under-served groups may be less informative partly due to more fragmented care, which can be viewed as a type of missing data problem. For EHR data in this complex form, there is currently no framework for introducing realistic missing values. There has also been little to no work in assessing the impact of missing data in EHR. In this work, we first introduce a terminology to define three levels of EHR data and then propose a novel framework for simulating realistic missing data scenarios in EHR to adequately assess their impact on predictive modeling. We incorporate the use of a medical knowledge graph to capture dependencies between medical events to create a more realistic missing data framework. In an intensive care unit setting, we found that missing data have greater negative impact on the performance of disease prediction models in groups that tend to have less access to healthcare, or seek less healthcare. We also found that the impact of missing data on disease prediction models is stronger when using the knowledge graph framework to introduce realistic missing values as opposed to random event removal.

摘要

电子健康记录 (EHR) 是医疗保健提供的常规组成部分,具有很大的潜力可用于改善患者的健康结果。它们包含多年的健康信息,可用于风险预测、疾病检测和治疗评估。然而,它们在机构之间没有一致的、标准化的格式,特别是在美国,并且会带来重大的分析挑战——它们包含来自异构领域的多尺度数据,并包括结构化和非结构化数据。个体患者的数据是在不规则的时间间隔和不同的频率下收集的。除了分析挑战之外,EHR 还可能反映出不公平现象——属于不同群体的患者在其健康记录中拥有不同数量的数据。这些问题中的许多都可能导致数据收集存在偏差。其结果是,服务不足群体的数据可能不那么有信息量,部分原因是护理更加碎片化,这可以被视为一种缺失数据问题。对于这种复杂形式的 EHR 数据,目前还没有引入真实缺失值的框架。在 EHR 中,也几乎没有工作来评估缺失数据的影响。在这项工作中,我们首先引入了一个术语来定义 EHR 数据的三个级别,然后提出了一种新的框架,用于模拟 EHR 中的真实缺失数据场景,以充分评估它们对预测建模的影响。我们结合使用医疗知识图谱来捕获医疗事件之间的依赖关系,以创建更真实的缺失数据框架。在重症监护病房环境中,我们发现缺失数据对疾病预测模型在那些倾向于较少获得医疗保健或较少寻求医疗保健的群体中的性能具有更大的负面影响。我们还发现,当使用知识图谱框架引入真实缺失值而不是随机事件删除时,缺失数据对疾病预测模型的影响更强。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索