Marks-Anglin Arielle, Chen Jianmin, Luo Chongliang, Hubbard Rebecca, Chen Yong
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.
Division of Public Health Sciences, Washington University School of Medicine, St Louis, MO, USA.
Stat Med. 2025 May;44(10-12):e70095. doi: 10.1002/sim.70095.
Electronic Health Record (EHR) databases are an increasingly valuable resource for observational studies. However, misclassification of EHR-derived outcomes due to imperfect phenotyping leads to bias, inflated type I error, and reduced power in risk-factor association studies. On the other hand, manual chart review to validate outcomes is both cost-prohibitive and time-consuming, and a randomly selected validation sample may not yield sufficient cases to support precise model estimation when the disease is rare. Sampling procedures have been developed for maximizing computational and statistical efficiency in settings where the true disease status is known. However, less work has been done in measurement constrained settings, particularly when an informative surrogate outcome is available. Motivated by this gap, we propose an Optimal Subsampling strategy with Surrogate-Assisted Two-step procedure (OSSAT) to guide cost-effective chart review in measurement constrained settings. The sampling weight in OSSAT leverages information contained in the potentially misclassified phenotype and covariates to prioritize observations most informative for the model of interest. We compare our proposed weight with existing approaches through simulations under various covariate distributions, differential misclassification rates and degrees of surrogate accuracy. We then apply our proposed weighting schemes to a study of risk factors for second breast cancer events using a real EHR data set.
电子健康记录(EHR)数据库对于观察性研究而言是一种越来越有价值的资源。然而,由于表型不完美导致源自EHR的结果出现错误分类,会在危险因素关联研究中导致偏差、第一类错误膨胀以及检验效能降低。另一方面,通过人工查阅病历以验证结果既成本高昂又耗时,而且当疾病罕见时,随机选择的验证样本可能无法产生足够的病例来支持精确的模型估计。在已知真实疾病状态的情况下,已经开发出抽样程序以实现计算和统计效率的最大化。然而,在测量受限的情况下开展的工作较少,特别是当有一个信息丰富的替代结局可用时。受这一差距的启发,我们提出一种具有替代辅助两步程序的最优子抽样策略(OSSAT),以指导在测量受限情况下进行具有成本效益的病历查阅。OSSAT中的抽样权重利用潜在错误分类的表型和协变量中包含的信息,对对于感兴趣的模型最具信息性的观察进行优先排序。我们通过在各种协变量分布、不同错误分类率和替代准确性程度下的模拟,将我们提出的权重与现有方法进行比较。然后,我们将我们提出的加权方案应用于一项使用真实EHR数据集的二次乳腺癌事件危险因素研究。