School of Computer Science, University of Oklahoma, Norman, OK, USA.
Department of Microbiology and Plant Biology, University of Oklahoma, Norman, OK, USA.
J Hum Genet. 2021 Apr;66(4):359-369. doi: 10.1038/s10038-020-00832-7. Epub 2020 Oct 2.
Polygenic risk scores (PRS) estimate the genetic risk of an individual for a complex disease based on many genetic variants across the whole genome. In this study, we compared a series of computational models for estimation of breast cancer PRS. A deep neural network (DNN) was found to outperform alternative machine learning techniques and established statistical algorithms, including BLUP, BayesA, and LDpred. In the test cohort with 50% prevalence, the Area Under the receiver operating characteristic Curve (AUC) were 67.4% for DNN, 64.2% for BLUP, 64.5% for BayesA, and 62.4% for LDpred. BLUP, BayesA, and LPpred all generated PRS that followed a normal distribution in the case population. However, the PRS generated by DNN in the case population followed a bimodal distribution composed of two normal distributions with distinctly different means. This suggests that DNN was able to separate the case population into a high-genetic-risk case subpopulation with an average PRS significantly higher than the control population and a normal-genetic-risk case subpopulation with an average PRS similar to the control population. This allowed DNN to achieve 18.8% recall at 90% precision in the test cohort with 50% prevalence, which can be extrapolated to 65.4% recall at 20% precision in a general population with 12% prevalence. Interpretation of the DNN model identified salient variants that were assigned insignificant p values by association studies, but were important for DNN prediction. These variants may be associated with the phenotype through nonlinear relationships.
多基因风险评分(PRS)基于全基因组中许多遗传变异来估计个体患复杂疾病的遗传风险。在本研究中,我们比较了一系列用于估计乳腺癌 PRS 的计算模型。发现深度神经网络(DNN)优于替代机器学习技术和已建立的统计算法,包括 BLUP、BayesA 和 LDpred。在具有 50%患病率的测试队列中,DNN 的接收器操作特征曲线下面积(AUC)为 67.4%,BLUP 为 64.2%,BayesA 为 64.5%,LDpred 为 62.4%。BLUP、BayesA 和 LPpred 生成的 PRS 在病例人群中均遵循正态分布。然而,DNN 在病例人群中生成的 PRS 遵循双峰分布,由两个均值明显不同的正态分布组成。这表明 DNN 能够将病例人群分为具有明显高于对照人群平均 PRS 的高遗传风险病例亚群和具有与对照人群相似平均 PRS 的正常遗传风险病例亚群。这使得 DNN 能够在具有 50%患病率的测试队列中以 90%的精度实现 18.8%的召回率,这可以外推到在具有 12%患病率的一般人群中以 20%的精度实现 65.4%的召回率。对 DNN 模型的解释确定了关联研究赋予不显著 p 值但对 DNN 预测很重要的显著变体。这些变体可能通过非线性关系与表型相关。