Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA.
Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA.
J Am Med Inform Assoc. 2022 Apr 13;29(5):918-927. doi: 10.1093/jamia/ocab267.
Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to select an optimal subsample to collect gold standard phenotypes for improving the accuracy of association estimation.
We develop a surrogate-assisted two-wave (SAT) sampling method, where a surrogate-guided sampling (SGS) procedure and a modified optimal subsampling procedure motivated from A-optimality criterion (OSMAC) are employed sequentially, to select a subsample for outcome validation through manual chart review subject to budget constraints. A model is then fitted based on the subsample with the true phenotypes. Simulation studies and an application to an EHR dataset of breast cancer survivors are conducted to demonstrate the effectiveness of SAT.
We found that the subsample selected with the proposed method contains informative observations that effectively reduce the mean squared error of the resultant estimator of the association.
The proposed approach can handle the problem brought by the rarity of cases and misclassification of the surrogate in phenotype-absent EHR-based association studies. With a well-behaved surrogate, SAT successfully boosts the case prevalence in the subsample and improves the efficiency of estimation.
电子健康记录(EHR)可用于研究表型与风险因素之间的关联。然而,仅依赖于可能容易出错的 EHR 衍生表型(即替代指标)的研究存在偏倚。低患病率表型的分析也可能效率低下。现有的方法通常侧重于解决其中一个问题,但很少同时解决两个问题。本研究旨在通过开发新的抽样方法来同时解决这两个问题,该方法选择最优子样本以收集金标准表型,从而提高关联估计的准确性。
我们开发了一种替代辅助两波(SAT)抽样方法,该方法采用了替代引导抽样(SGS)程序和基于 A 最优性准则(OSMAC)的修改后的最优子抽样程序,在预算约束下,通过手动图表审查为结局验证选择子样本。然后基于真实表型的子样本拟合模型。通过模拟研究和对乳腺癌幸存者 EHR 数据集的应用,证明了 SAT 的有效性。
我们发现,所提出的方法选择的子样本包含信息丰富的观测值,可有效降低关联的结果估计量的均方误差。
该方法可以处理基于 EHR 的关联研究中罕见病例和替代指标分类错误带来的问题。在替代指标表现良好的情况下,SAT 可以成功提高子样本中的病例流行率并提高估计效率。