Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States.
Department of Epidemiology, Mailman School of Public Health at Columbia University, New York, NY, United States.
JMIR Med Inform. 2024 Oct 1;12:e58085. doi: 10.2196/58085.
Electronic health records (EHRs) are increasingly used for epidemiologic research to advance public health practice. However, key variables are susceptible to missing data or misclassification within EHRs, including demographic information or disease status, which could affect the estimation of disease prevalence or risk factor associations.
In this paper, we applied methods from the literature on missing data and causal inference to assess whether we could mitigate information biases when estimating measures of association between potential risk factors and diabetes among a patient population of New York City young adults.
We estimated the odds ratio (OR) for diabetes by race or ethnicity and asthma status using EHR data from NYU Langone Health. Methods from the missing data and causal inference literature were then applied to assess the ability to control for misclassification of health outcomes in the EHR data. We compared EHR-based associations with associations observed from 2 national health surveys, the Behavioral Risk Factor Surveillance System (BRFSS) and the National Health and Nutrition Examination Survey, representing traditional public health surveillance systems.
Observed EHR-based associations between race or ethnicity and diabetes were comparable to health survey-based estimates, but the association between asthma and diabetes was significantly overestimated (OREHR 3.01, 95% CI 2.86-3.18 vs ORBRFSS 1.23, 95% CI 1.09-1.40). Missing data and causal inference methods reduced information biases in these estimates, yielding relative differences from traditional estimates below 50% (ORMissingData 1.79, 95% CI 1.67-1.92 and ORCausal 1.42, 95% CI 1.34-1.51).
Findings suggest that without bias adjustment, EHR analyses may yield biased measures of association, driven in part by subgroup differences in health care use. However, applying missing data or causal inference frameworks can help control for and, importantly, characterize residual information biases in these estimates.
电子健康记录(EHR)越来越多地用于流行病学研究,以推进公共卫生实践。然而,关键变量容易在 EHR 中出现数据缺失或分类错误,包括人口统计学信息或疾病状态,这可能会影响疾病流行率或风险因素关联的估计。
在本文中,我们应用文献中关于缺失数据和因果推理的方法,评估在估计纽约市年轻成年人患者群体中潜在风险因素与糖尿病之间的关联度量时,是否可以减轻信息偏倚。
我们使用 NYU Langone Health 的 EHR 数据,估计了按种族或族裔和哮喘状况划分的糖尿病的优势比(OR)。然后应用缺失数据和因果推理文献中的方法,评估控制 EHR 数据中健康结果分类错误的能力。我们比较了 EHR 基于的关联与来自 2 个国家健康调查(行为风险因素监测系统(BRFSS)和国家健康和营养检查调查)的关联,这些调查代表了传统的公共卫生监测系统。
观察到的 EHR 基于种族或族裔与糖尿病之间的关联与健康调查基于的估计值相当,但哮喘与糖尿病之间的关联被高估(EHR 观察 OR 3.01,95%CI 2.86-3.18 vs BRFSS 观察 OR 1.23,95%CI 1.09-1.40)。缺失数据和因果推理方法减少了这些估计中的信息偏倚,使得相对于传统估计值的差异低于 50%(缺失数据方法 ORM 1.79,95%CI 1.67-1.92 和因果方法 ORC 1.42,95%CI 1.34-1.51)。
研究结果表明,如果不进行偏差调整,EHR 分析可能会产生有偏差的关联度量,这部分归因于医疗保健使用方面的亚组差异。但是,应用缺失数据或因果推理框架可以帮助控制和重要的是,描述这些估计中剩余的信息偏倚。