Squires Steven, Weedon Michael N, Oram Richard A
University of Exeter, Exeter, United Kingdom.
medRxiv. 2023 Dec 15:2023.12.14.23299972. doi: 10.1101/2023.12.14.23299972.
Polygenic risk scores (PRS) summarise genetic information into a single number with multiple clinical and research uses. Machine learning (ML) has revolutionised a diverse set of fields, however, the impact of ML on genomics in general, and PRSs in particular, has been less significant. We explore how ML can improve the generation of PRSs.
We train ML models on known PRSs using UK Biobank data. We explore whether the models can recreate human programmed PRSs, including using a single model to generate multiple PRSs, and the difficulty in using ML for PRS generation. We also investigate how ML can compensate for missing data and the constraints on performance.
We demonstrate almost perfect generation of PRSs, including when using one model to predict multiple scores, and with little loss of performance with reduced quantity of training data. For an example set of missing SNPs the MLP produces predictions that enable separation of cases from population samples with an area under the receiver operating characteristic curve of 0.847 (95% CI: 0.828-0.864) compared to 0.798 (95% CI: 0.779-0.818) for the PRS. We provide evidence that input information is the limiting factor of further improvement.
ML can accurately generate PRSs, including with one model for multiple PRSs. The models are transferable and have high longevity. With certain missing SNPs the ML models can statistically significantly improve on normal PRS generation. Models trained are probably at the edge of performance and further improvements likely require use of additional input data.
多基因风险评分(PRS)将遗传信息汇总为一个单一数字,具有多种临床和研究用途。机器学习(ML)已经彻底改变了多个领域,然而,ML对一般基因组学,尤其是对PRS的影响却不那么显著。我们探讨了ML如何能够改进PRS的生成。
我们使用英国生物银行的数据在已知的PRS上训练ML模型。我们探究这些模型是否能够重现人工设定的PRS,包括使用单个模型生成多个PRS,以及使用ML进行PRS生成的难度。我们还研究了ML如何能够弥补缺失数据以及性能上的限制。
我们展示了几乎完美的PRS生成,包括使用一个模型预测多个评分,并且在减少训练数据量的情况下性能几乎没有损失。对于一组缺失单核苷酸多态性(SNP)的示例,多层感知器(MLP)产生的预测能够将病例与人群样本区分开来,受试者操作特征曲线下面积为0.847(95%置信区间:0.828 - 0.864),而PRS的该面积为0.798(95%置信区间:0.779 - 0.818)。我们提供证据表明输入信息是进一步改进的限制因素。
ML能够准确生成PRS,包括使用一个模型生成多个PRS。这些模型具有可转移性且寿命长。对于某些缺失的SNP,ML模型在正常PRS生成方面能够在统计学上显著改进。所训练的模型可能已接近性能极限,进一步改进可能需要使用额外的输入数据。