Chen Ting-Huei, Chatterjee Nilanjan, Landi Maria Teresa, Shi Jianxin
Department of Mathematics and Statistics, Regular member, Cervo Brain Research Centre, University of Laval, 1045, av. of Medicine, Suite 1056, Quebec G1V 0A6, Canada.
Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University Baltimore, Maryland, United States of America, 615 N Wolfe Street Baltimore, MD 21205.
J Am Stat Assoc. 2021;116(533):133-143. doi: 10.1080/01621459.2020.1764849. Epub 2020 Oct 12.
Large-scale genome-wide association (GWAS) studies provide opportunities for developing genetic risk prediction models that have the potential to improve disease prevention, intervention or treatment. The key step is to develop polygenic risk score (PRS) models with high predictive performance for a given disease, which typically requires a large training data set for selecting truly associated single nucleotide polymorphisms (SNPs) and estimating effect sizes accurately. Here, we develop a comprehensive penalized regression for fitting regularized regression models to GWAS summary statistics. We propose incorporating Pleiotropy and ANnotation information into PRS (PANPRS) development through suitable formulation of penalty functions and associated tuning parameters. Extensive simulations show that PANPRS performs equally well or better than existing PRS methods when no functional annotation or pleiotropy is incorporated. When functional annotation data and pleiotropy are informative, PANPRS substantially outperforms existing PRS methods in simulations. Finally, we applied our methods to build PRS for type 2 diabetes and melanoma and found that incorporating relevant functional annotations and GWAS of genetically related traits improved prediction of these two complex diseases.
大规模全基因组关联(GWAS)研究为开发遗传风险预测模型提供了机会,这些模型有可能改善疾病预防、干预或治疗。关键步骤是为特定疾病开发具有高预测性能的多基因风险评分(PRS)模型,这通常需要大量训练数据集来选择真正相关的单核苷酸多态性(SNP)并准确估计效应大小。在此,我们开发了一种综合惩罚回归方法,用于将正则化回归模型拟合到GWAS汇总统计数据。我们建议通过适当设定惩罚函数和相关调整参数,将多效性和注释信息纳入PRS(PANPRS)的开发过程。大量模拟表明,在不纳入功能注释或多效性时,PANPRS的表现与现有PRS方法相当或更优。当功能注释数据和多效性信息丰富时,PANPRS在模拟中显著优于现有PRS方法。最后,我们应用我们的方法构建2型糖尿病和黑色素瘤的PRS,发现纳入相关功能注释和遗传相关性状的GWAS可改善这两种复杂疾病的预测。