Beesley Lauren J, Mukherjee Bhramar
University of Michigan, Department of Biostatistics.
medRxiv. 2020 Dec 23:2020.12.21.20248644. doi: 10.1101/2020.12.21.20248644.
Electronic Health Records (EHR) are not designed for population-based research, but they provide access to longitudinal health information for many individuals. Many statistical methods have been proposed to account for selection bias, missing data, phenotyping errors, or other problems that arise in EHR data analysis. However, addressing multiple sources of bias simultaneously is challenging. Recently, we developed a methodological framework (R package, ) for jointly handling both selection bias and phenotype misclassification in the EHR setting that leverages external data sources. These methods assume factors related to selection and misclassification are fully observed, but these factors may be poorly understood and partially observed in practice. As a follow-up to the methodological work, we explore how these methods perform for three real-world case studies. In all three examples, we use individual patient-level data collected through the University of Michigan Health System and various external population-based data sources. In case study (a), we explore the impact of these methods on estimated associations between gender and cancer diagnosis. In case study (b), we compare corrected associations between previously identified genetic loci and age-related macular degeneration with gold standard external estimates. In case study (c), we evaluate these methods for modeling the association of COVID-19 outcomes and potential risk factors. These case studies illustrate how to utilize diverse auxiliary information to achieve less biased inference in EHR-based research.
电子健康记录(EHR)并非为基于人群的研究而设计,但它们能为许多个体提供纵向健康信息。人们已经提出了许多统计方法来处理EHR数据分析中出现的选择偏倚、数据缺失、表型错误或其他问题。然而,同时解决多种偏倚来源具有挑战性。最近,我们开发了一个方法框架(R包, ),用于在EHR环境中联合处理选择偏倚和表型错误分类,该框架利用了外部数据源。这些方法假设与选择和错误分类相关的因素是完全可观测的,但在实际中,这些因素可能理解不足且部分可观测。作为方法学工作的后续,我们探讨这些方法在三个实际案例研究中的表现。在所有三个例子中,我们使用通过密歇根大学医疗系统收集的个体患者层面数据以及各种基于人群的外部数据源。在案例研究(a)中,我们探讨这些方法对性别与癌症诊断之间估计关联的影响。在案例研究(b)中,我们将先前确定的基因位点与年龄相关性黄斑变性之间的校正关联与金标准外部估计值进行比较。在案例研究(c)中,我们评估这些方法对COVID-19结局与潜在风险因素之间关联进行建模的情况。这些案例研究说明了如何利用多样的辅助信息在基于EHR的研究中实现偏差较小的推断。