Vilhjálmsson Bjarni J, Yang Jian, Finucane Hilary K, Gusev Alexander, Lindström Sara, Ripke Stephan, Genovese Giulio, Loh Po-Ru, Bhatia Gaurav, Do Ron, Hayeck Tristan, Won Hong-Hee, Kathiresan Sekar, Pato Michele, Pato Carlos, Tamimi Rulla, Stahl Eli, Zaitlen Noah, Pasaniuc Bogdan, Belbin Gillian, Kenny Eimear E, Schierup Mikkel H, De Jager Philip, Patsopoulos Nikolaos A, McCarroll Steve, Daly Mark, Purcell Shaun, Chasman Daniel, Neale Benjamin, Goddard Michael, Visscher Peter M, Kraft Peter, Patterson Nick, Price Alkes L
Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Bioinformatics Research Centre, Aarhus University, 8000 Aarhus, Denmark.
Queensland Brain Institute, University of Queensland, Brisbane, 4072 QLD, Australia; Diamantina Institute, Translational Research Institute, University of Queensland, Brisbane, 4101 QLD, Australia.
Am J Hum Genet. 2015 Oct 1;97(4):576-92. doi: 10.1016/j.ajhg.2015.09.001.
Polygenic risk scores have shown great promise in predicting complex disease risk and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics, but this discards information and can reduce predictive accuracy. We introduce LDpred, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the approach of pruning followed by thresholding, particularly at large sample sizes. Accordingly, predicted R(2) increased from 20.1% to 25.3% in a large schizophrenia dataset and from 9.8% to 12.0% in a large multiple sclerosis dataset. A similar relative improvement in accuracy was observed for three additional large disease datasets and for non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.
多基因风险评分在预测复杂疾病风险方面已显示出巨大潜力,并且随着训练样本量的增加会变得更加准确。计算风险评分的标准方法涉及基于连锁不平衡(LD)的标记物筛选以及对关联统计量应用p值阈值,但这会丢弃信息并可能降低预测准确性。我们引入了LDpred方法,该方法通过使用效应大小的先验信息和来自外部参考面板的LD信息来推断每个标记物的后验平均效应大小。理论和模拟表明,LDpred优于先进行筛选然后设置阈值的方法,尤其是在大样本量时。因此,在一个大型精神分裂症数据集中,预测的R(2)从20.1%提高到了25.3%,在一个大型多发性硬化症数据集中从9.8%提高到了12.0%。在另外三个大型疾病数据集以及非欧洲精神分裂症样本中也观察到了类似的相对准确性提高。随着样本量的增加,LDpred相对于现有方法的优势将更加明显。