Pimplaskar Aditya, Qiu Junqiong, Lapinska Sandra, Tozzo Veronica, Chiang Jeffrey N, Pasaniuc Bogdan, Olde Loohuis Loes M
Bioinformatics Interdepartmental Program, UCLA, Los Angeles, CA, USA.
Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA.
medRxiv. 2025 Apr 6:2025.04.04.25325131. doi: 10.1101/2025.04.04.25325131.
Electronic Health Records (EHR) -linked biobanks have emerged as promising tools for precision medicine, enabling the integration of clinical and molecular data for individual risk assessment. Association studies performed in biobank studies can connect common genetic variation to clinical phenotypes, such as through the use of polygenic scores (PGS), which are starting to have utility in aiding clinician decision making. However, while biobanks aggregate large amounts of data effectively for such studies, most employ various opt-in consent protocols, and, as a result, are expected to be subject to participation and recruitment biases. The extent to which biases affect genetic analyses in biobanks remains unstudied. In this study, we quantify bias and evaluate its impact on genetic analyses, using the UCLA ATLAS Community Health Initiative as a case study. Our analyses reveal that a wide array of factors, particularly socio-demographic characteristics and healthcare utilization patterns, influence participation, effectively differentiating biobank participants from the broader patient population (AUROC = 0.85, AUPRC = 0.82). Through weighting the sample using inverse probability weights derived from probabilities of enrollment, we replicated 54% more known GWAS variants than models that did not take bias into account (e.g. associations between variants in the gene and type 2 diabetes). We further show that PGS-Phenome wide associations are affected by the weighting scheme, and suggest associations corroborated by weighted analyses to be more robust. Our results highlight that genetic analyses within biobanks should account for inclusion biases, and suggest inverse probability weighting as a potential approach.
与电子健康记录(EHR)相关联的生物样本库已成为精准医学中颇具前景的工具,能够整合临床和分子数据以进行个体风险评估。在生物样本库研究中开展的关联研究可以将常见的基因变异与临床表型联系起来,例如通过使用多基因分数(PGS),而多基因分数已开始在辅助临床医生决策方面发挥作用。然而,虽然生物样本库能有效地为这类研究汇总大量数据,但大多数采用各种加入式同意方案,因此预计会存在参与和招募偏差。偏差对生物样本库基因分析的影响程度仍未得到研究。在本研究中,我们以加州大学洛杉矶分校ATLAS社区健康倡议为例,对偏差进行量化并评估其对基因分析的影响。我们的分析表明,一系列因素,特别是社会人口统计学特征和医疗保健利用模式,会影响参与情况,从而有效地将生物样本库参与者与更广泛的患者群体区分开来(曲线下面积 = 0.85,精确召回率曲线下面积 = 0.82)。通过使用从入选概率得出的逆概率权重对样本进行加权,我们复制出的已知全基因组关联研究(GWAS)变异比未考虑偏差的模型多54%(例如基因中的变异与2型糖尿病之间的关联)。我们进一步表明,PGS与全表型组的关联受加权方案的影响,并建议经加权分析证实的关联更具稳健性。我们的结果强调,生物样本库内的基因分析应考虑纳入偏差,并建议将逆概率加权作为一种潜在方法。