通过遗传信息多重填补和样本加权减少电子健康记录关联生物样本库中的信息偏倚和选择偏倚

Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting.

作者信息

Salvatore Maxwell, Kundu Ritoban, Du Jiacong, Friese Christopher R, Mondul Alison M, Hanauer David, Lu Haidong, Pearce Celeste Leigh, Mukherjee Bhramar

机构信息

Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA.

Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA.

出版信息

medRxiv. 2024 Oct 29:2024.10.28.24316286. doi: 10.1101/2024.10.28.24316286.

DOI:10.1101/2024.10.28.24316286

PMID:39574876

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11581092/

Abstract

Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.

摘要

电子健康记录（EHRs）对公共卫生和临床研究很有价值，但容易出现多种偏差来源，包括数据缺失和非概率抽样。由于可能存在未记录、数据碎片化或具有临床意义的缺失情况，EHRs中的数据缺失问题较为复杂。本研究探讨了基于多基因风险评分（PRS）的缺失性状多重填补与样本加权相结合，是否能够在估计疾病-暴露关联时减轻数据缺失和选择偏差。针对不同抽样机制下的完全随机缺失（MCAR）、随机缺失（MAR）和非随机缺失（MNAR）情况进行了模拟。基于PRS的多重填补总体上显示出较低的偏差，尤其是在与样本加权相结合时。例如，在具有暴露和结局MAR数据的10,000个有偏差样本中，在协变量调整、加权的多重填补场景下，与未使用PRS的填补（偏差率4.5%；覆盖率0.877）和完全病例分析（偏差率10.3%；覆盖率0.784）相比，基于PRS的填补偏差率较低（3.8%），覆盖率较高（0.883）。在一项使用密歇根基因组计划（n = 50,026）数据的案例研究中，与忽略数据缺失和选择偏差的分析相比，基于PRS的填补与样本加权的“我们所有人”衍生基准更为接近。研究人员应考虑利用遗传数据和样本加权来解决生物样本库中数据缺失和非概率抽样带来的偏差。