Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan, USA.
Biometrics. 2022 Mar;78(1):214-226. doi: 10.1111/biom.13400. Epub 2020 Dec 3.
Health research using electronic health records (EHR) has gained popularity, but misclassification of EHR-derived disease status and lack of representativeness of the study sample can result in substantial bias in effect estimates and can impact power and type I error. In this paper, we develop new strategies for handling disease status misclassification and selection bias in EHR-based association studies. We first focus on each type of bias separately. For misclassification, we propose three novel likelihood-based bias correction strategies. A distinguishing feature of the EHR setting is that misclassification may be related to patient-varying factors, and the proposed methods leverage data in the EHR to estimate misclassification rates without gold standard labels. For addressing selection bias, we describe how calibration and inverse probability weighting methods from the survey sampling literature can be extended and applied to the EHR setting. Addressing misclassification and selection biases simultaneously is a more challenging problem than dealing with each on its own, and we propose several new strategies. For all methods proposed, we derive valid standard error estimators and provide software for implementation. We provide a new suite of statistical estimation and inference strategies for addressing misclassification and selection bias simultaneously that is tailored to problems arising in EHR data analysis. We apply these methods to data from The Michigan Genomics Initiative, a longitudinal EHR-linked biorepository.
利用电子健康记录 (EHR) 进行健康研究已经越来越受欢迎,但 EHR 衍生疾病状态的分类错误和研究样本的代表性不足可能会导致效应估计产生大量偏差,并影响效力和 I 型错误率。在本文中,我们为处理基于 EHR 的关联研究中的疾病状态分类错误和选择偏差开发了新策略。我们首先分别关注每种偏差。对于分类错误,我们提出了三种新的基于似然的偏差校正策略。EHR 设置的一个显著特点是,分类错误可能与患者个体差异有关,所提出的方法利用 EHR 中的数据在没有金标准标签的情况下估计分类错误率。为了解决选择偏差,我们描述了如何扩展调查抽样文献中的校准和逆概率加权方法,并将其应用于 EHR 设置。同时解决分类错误和选择偏差比单独处理每一个问题更具挑战性,我们提出了几种新策略。对于提出的所有方法,我们推导出有效的标准误差估计量,并提供了用于实现的软件。我们为解决 EHR 数据分析中出现的问题,提供了一套新的同时处理分类错误和选择偏差的统计估计和推断策略。我们将这些方法应用于密歇根基因组倡议(The Michigan Genomics Initiative)的数据,这是一个纵向的 EHR 链接生物库。