Hrytsenko Yana, Shea Benjamin, Elgart Michael, Kurniansyah Nuzulul, Lyons Genevieve, Morrison Alanna C, Carson April P, Haring Bernhard, Mitchel Braxton D, Psaty Bruce M, Jaeger Byron C, Gu C Charles, Kooperberg Charles, Levy Daniel, Lloyd-Jones Donald, Choi Eunhee, Brody Jennifer A, Smith Jennifer A, Rotter Jerome I, Moll Matthew, Fornage Myriam, Simon Noah, Castaldi Peter, Casanova Ramon, Chung Ren-Hua, Kaplan Robert, Loos Ruth J F, Kardia Sharon L R, Rich Stephen S, Redline Susan, Kelly Tanika, O'Connor Timothy, Zhao Wei, Kim Wonji, Guo Xiuqing, Der Ida Chen Yii, Sofer Tamar
Department of Medicine, Brigham and Women's Hospital, Boston, MA.
Department of Medicine, Harvard Medical School, Boston, MA.
medRxiv. 2023 Dec 14:2023.12.13.23299909. doi: 10.1101/2023.12.13.23299909.
We construct non-linear machine learning (ML) prediction models for systolic and diastolic blood pressure (SBP, DBP) using demographic and clinical variables and polygenic risk scores (PRSs). We developed a two-model ensemble, consisting of a baseline model, where prediction is based on demographic and clinical variables only, and a genetic model, where we also include PRSs. We evaluate the use of a linear versus a non-linear model at both the baseline and the genetic model levels and assess the improvement in performance when incorporating multiple PRSs. We report the ensemble model's performance as percentage variance explained (PVE) on a held-out test dataset. A non-linear baseline model improved the PVEs from 28.1% to 30.1% (SBP) and 14.3% to 17.4% (DBP) compared with a linear baseline model. Including seven PRSs in the genetic model computed based on the largest available GWAS of SBP/DBP improved the genetic model PVE from 4.8% to 5.1% (SBP) and 4.7% to 5% (DBP) compared to using a single PRS. Adding additional 14 PRSs computed based on two independent GWASs further increased the genetic model PVE to 6.3% (SBP) and 5.7% (DBP). PVE differed across self-reported race/ethnicity groups, with primarily all non-White groups benefitting from the inclusion of additional PRSs.
我们使用人口统计学和临床变量以及多基因风险评分(PRS)构建用于收缩压和舒张压(SBP、DBP)的非线性机器学习(ML)预测模型。我们开发了一个双模型集成,包括一个基线模型(仅基于人口统计学和临床变量进行预测)和一个遗传模型(其中还纳入了PRS)。我们在基线模型和遗传模型层面评估线性模型与非线性模型的使用情况,并评估纳入多个PRS时性能的提升。我们在一个留出的测试数据集上报告集成模型的性能,以解释方差百分比(PVE)表示。与线性基线模型相比,非线性基线模型将SBP的PVE从28.1%提高到30.1%,将DBP的PVE从14.3%提高到17.4%。在基于最大可用的SBP/DBP全基因组关联研究(GWAS)计算的遗传模型中纳入七个PRS,与使用单个PRS相比,将遗传模型的PVE从4.8%提高到5.1%(SBP),从4.7%提高到5%(DBP)。基于两个独立的GWAS计算添加另外14个PRS,进一步将遗传模型的PVE提高到6.3%(SBP)和5.7%(DBP)。PVE在自我报告的种族/族裔群体中存在差异,主要是所有非白人组从纳入额外的PRS中受益。