Department of Epidemiology, University of Michigan, Ann Arbor, MI 48109-2029, United States.
Center for Precision Health Data Science, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, United States.
J Am Med Inform Assoc. 2024 Jun 20;31(7):1479-1492. doi: 10.1093/jamia/ocae098.
To develop recommendations regarding the use of weights to reduce selection bias for commonly performed analyses using electronic health record (EHR)-linked biobank data.
We mapped diagnosis (ICD code) data to standardized phecodes from 3 EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n = 244 071), Michigan Genomics Initiative (MGI; n = 81 243), and UK Biobank (UKB; n = 401 167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to represent the US adult population more. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted 4 common analyses comparing unweighted and weighted results.
For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted phenome-wide association study for colorectal cancer, the strongest associations remained unaltered, with considerable overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates.
Weighting had a limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation. When interested in estimating effect size, specific signals from untargeted association analyses should be followed up by weighted analysis.
EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.
针对使用电子健康记录(EHR)链接生物库数据进行常见分析时,为减少选择偏差,制定与体重相关的使用建议。
我们将诊断(ICD 代码)数据映射到三个具有不同招募策略的 EHR 链接生物库的标准化 phecode:All of Us(AOU;n=244071)、密歇根基因组倡议(MGI;n=81243)和英国生物库(UKB;n=401167)。我们利用 2019 年全国健康访谈调查数据,构建了 AOU 和 MGI 的选择权重,以更好地代表美国成年人群体。我们使用先前为 UKB 开发的权重来代表 UKB 合格人群。我们进行了 4 项常见分析,比较了未加权和加权的结果。
对于 AOU 和 MGI,加权后 phecode 的估计患病率降低(加权与未加权 phecode 患病率比[MPR]:0.82 和 0.61),而 UKB 的估计值增加(MPR:1.06)。加权对潜在表型维度估计的影响较小。比较结直肠癌的加权与未加权表型全基因组关联研究,最强的关联仍未改变,显著命中结果具有较大重叠。加权对性别和结直肠癌的估计对数优势比的影响,使其与基于国家登记处的估计值更为一致。
加权对维度估计和大规模假设检验的影响有限,但影响了患病率和关联估计。当有兴趣估计效应大小时,应通过加权分析跟进未靶向关联分析的特定信号。
EHR 链接生物库应报告招募和选择机制,并提供具有定义目标人群的选择权重。研究人员应考虑其预期的估计量,指定源和目标人群,并相应地加权 EHR 链接生物库分析。