Suppr超能文献

SAT:一种基于替代辅助的两波病例增强抽样方法,应用于基于电子健康记录的关联研究。

SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies.

机构信息

Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA.

Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA.

出版信息

J Am Med Inform Assoc. 2022 Apr 13;29(5):918-927. doi: 10.1093/jamia/ocab267.

Abstract

OBJECTIVES

Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to select an optimal subsample to collect gold standard phenotypes for improving the accuracy of association estimation.

MATERIALS AND METHODS

We develop a surrogate-assisted two-wave (SAT) sampling method, where a surrogate-guided sampling (SGS) procedure and a modified optimal subsampling procedure motivated from A-optimality criterion (OSMAC) are employed sequentially, to select a subsample for outcome validation through manual chart review subject to budget constraints. A model is then fitted based on the subsample with the true phenotypes. Simulation studies and an application to an EHR dataset of breast cancer survivors are conducted to demonstrate the effectiveness of SAT.

RESULTS

We found that the subsample selected with the proposed method contains informative observations that effectively reduce the mean squared error of the resultant estimator of the association.

CONCLUSIONS

The proposed approach can handle the problem brought by the rarity of cases and misclassification of the surrogate in phenotype-absent EHR-based association studies. With a well-behaved surrogate, SAT successfully boosts the case prevalence in the subsample and improves the efficiency of estimation.

摘要

目的

电子健康记录(EHR)可用于研究表型与风险因素之间的关联。然而,仅依赖于可能容易出错的 EHR 衍生表型(即替代指标)的研究存在偏倚。低患病率表型的分析也可能效率低下。现有的方法通常侧重于解决其中一个问题,但很少同时解决两个问题。本研究旨在通过开发新的抽样方法来同时解决这两个问题,该方法选择最优子样本以收集金标准表型,从而提高关联估计的准确性。

材料和方法

我们开发了一种替代辅助两波(SAT)抽样方法,该方法采用了替代引导抽样(SGS)程序和基于 A 最优性准则(OSMAC)的修改后的最优子抽样程序,在预算约束下,通过手动图表审查为结局验证选择子样本。然后基于真实表型的子样本拟合模型。通过模拟研究和对乳腺癌幸存者 EHR 数据集的应用,证明了 SAT 的有效性。

结果

我们发现,所提出的方法选择的子样本包含信息丰富的观测值,可有效降低关联的结果估计量的均方误差。

结论

该方法可以处理基于 EHR 的关联研究中罕见病例和替代指标分类错误带来的问题。在替代指标表现良好的情况下,SAT 可以成功提高子样本中的病例流行率并提高估计效率。

相似文献

5

本文引用的文献

4
How many rare diseases are there?有多少种罕见病?
Nat Rev Drug Discov. 2020 Feb;19(2):77-78. doi: 10.1038/d41573-019-00180-y.
8
Optimal Subsampling for Large Sample Logistic Regression.大样本逻辑回归的最优子采样
J Am Stat Assoc. 2018;113(522):829-844. doi: 10.1080/01621459.2017.1292914. Epub 2018 Jun 6.
10
Enabling phenotypic big data with PheNorm.利用 PheNorm 实现表型大数据。
J Am Med Inform Assoc. 2018 Jan 1;25(1):54-60. doi: 10.1093/jamia/ocx111.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验