Jiang Wei, Chen Ling, Girgenti Matthew J, Zhao Hongyu
Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
Department of Statistics, Columbia University, New York, NY, USA.
Res Sq. 2023 May 31:rs.3.rs-2939390. doi: 10.21203/rs.3.rs-2939390/v1.
Predicting genetic risks for common diseases may improve their prevention and early treatment. In recent years, various additive-model-based polygenic risk scores (PRS) methods have been proposed to combine the estimated effects of single nucleotide polymorphisms (SNPs) using data collected from genome-wide association studies (GWAS). Some of these methods require access to another external individual-level GWAS dataset to tune the hyperparameters, which can be difficult because of privacy and security-related concerns. Additionally, leaving out partial data for hyperparameter tuning can reduce the predictive accuracy of the constructed PRS model. In this article, we propose a novel method, called PRStuning, to automatically tune hyperparameters for different PRS methods using only GWAS summary statistics from the training data. The core idea is to first predict the performance of the PRS method with different parameter values, and then select the parameters with the best prediction performance. Because directly using the effects observed from the training data tends to overestimate the performance in the testing data (a phenomenon known as overfitting), we adopt an empirical Bayes approach to shrinking the predicted performance in accordance with the estimated genetic architecture of the disease. Results from extensive simulations and real data applications demonstrate that PRStuning can accurately predict the PRS performance across PRS methods and parameters, and it can help select the best-performing parameters.
预测常见疾病的遗传风险可能会改善其预防和早期治疗。近年来,已经提出了各种基于加性模型的多基因风险评分(PRS)方法,以利用从全基因组关联研究(GWAS)收集的数据来综合单核苷酸多态性(SNP)的估计效应。其中一些方法需要访问另一个外部个体水平的GWAS数据集来调整超参数,由于隐私和安全相关问题,这可能具有挑战性。此外,留出部分数据用于超参数调整可能会降低构建的PRS模型的预测准确性。在本文中,我们提出了一种名为PRStuning的新方法,仅使用训练数据的GWAS汇总统计信息就可以自动为不同的PRS方法调整超参数。核心思想是首先预测具有不同参数值的PRS方法的性能,然后选择具有最佳预测性能的参数。由于直接使用从训练数据中观察到的效应往往会高估测试数据中的性能(一种称为过拟合的现象),因此我们采用经验贝叶斯方法根据疾病的估计遗传结构来收缩预测性能。广泛的模拟和实际数据应用结果表明,PRStuning可以准确预测不同PRS方法和参数下的PRS性能,并且可以帮助选择性能最佳的参数。