Suppr超能文献

利用电子健康记录、验证样本和多重填补法对预测变量和事件发生时间结局中的相关误差进行统计分析

ACCOUNTING FOR DEPENDENT ERRORS IN PREDICTORS AND TIME-TO-EVENT OUTCOMES USING ELECTRONIC HEALTH RECORDS, VALIDATION SAMPLES, AND MULTIPLE IMPUTATION.

作者信息

Giganti Mark J, Shaw Pamela A, Chen Guanhua, Bebawy Sally S, Turner Megan M, Sterling Timothy R, Shepherd Bryan E

机构信息

Department of Biostatistics, Vanderbilt University.

Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania.

出版信息

Ann Appl Stat. 2020 Jun;14(2):1045-1061. doi: 10.1214/20-aoas1343. Epub 2020 Jun 29.

Abstract

Data from electronic health records (EHR) are prone to errors, which are often correlated across multiple variables. The error structure is further complicated when analysis variables are derived as functions of two or more error-prone variables. Such errors can substantially impact estimates, yet we are unaware of methods that simultaneously account for errors in covariates and time-to-event outcomes. Using EHR data from 4217 patients, the hazard ratio for an AIDS-defining event associated with a 100 cell/mm increase in CD4 count at ART initiation was 0.74 (95%CI: 0.68-0.80) using unvalidated data and 0.60 (95%CI: 0.53-0.68) using fully validated data. Our goal is to obtain unbiased and efficient estimates after validating a random subset of records. We propose fitting discrete failure time models to the validated subsample and then multiply imputing values for unvalidated records. We demonstrate how this approach simultaneously addresses dependent errors in predictors, time-to-event outcomes, and inclusion criteria. Using the fully validated dataset as a gold standard, we compare the mean squared error of our estimates with those from the unvalidated dataset and the corresponding subsample-only dataset for various subsample sizes. By incorporating reasonably sized validated subsamples and appropriate imputation models, our approach had improved estimation over both the naive analysis and the analysis using only the validation subsample.

摘要

电子健康记录(EHR)中的数据容易出错,且这些错误通常在多个变量之间相互关联。当分析变量是由两个或更多易出错变量的函数推导得出时,错误结构会进一步复杂化。此类错误会对估计产生重大影响,但我们并不知晓能够同时考虑协变量和事件发生时间结局中错误的方法。利用4217名患者的电子健康记录数据,在未经验证的数据中,与抗逆转录病毒治疗开始时CD4细胞计数每增加100个/立方毫米相关的艾滋病定义事件的风险比为0.74(95%置信区间:0.68 - 0.80),而在经过完全验证的数据中为0.60(95%置信区间:0.53 - 0.68)。我们的目标是在验证随机抽取的一部分记录后获得无偏且有效的估计。我们建议对经过验证的子样本拟合离散失效时间模型,然后对未经验证的记录进行多重填补。我们展示了这种方法如何同时处理预测变量、事件发生时间结局和纳入标准中的相关错误。以经过完全验证的数据集作为金标准,我们比较了不同子样本大小下我们的估计值与未经验证数据集以及仅对应子样本数据集的均方误差。通过纳入大小合理的经过验证的子样本和适当的填补模型,我们的方法在估计方面比单纯分析和仅使用验证子样本的分析都有改进。

相似文献

引用本文的文献

10

本文引用的文献

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验