Spiliopoulou Athina, Nagy Reka, Bermingham Mairead L, Huffman Jennifer E, Hayward Caroline, Vitart Veronique, Rudan Igor, Campbell Harry, Wright Alan F, Wilson James F, Pong-Wong Ricardo, Agakov Felix, Navarro Pau, Haley Chris S
MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK, Pharmatics Limited, Edinburgh EH16 4UX, UK.
MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK.
Hum Mol Genet. 2015 Jul 15;24(14):4167-82. doi: 10.1093/hmg/ddv145. Epub 2015 Apr 26.
We explore the prediction of individuals' phenotypes for complex traits using genomic data. We compare several widely used prediction models, including Ridge Regression, LASSO and Elastic Nets estimated from cohort data, and polygenic risk scores constructed using published summary statistics from genome-wide association meta-analyses (GWAMA). We evaluate the interplay between relatedness, trait architecture and optimal marker density, by predicting height, body mass index (BMI) and high-density lipoprotein level (HDL) in two data cohorts, originating from Croatia and Scotland. We empirically demonstrate that dense models are better when all genetic effects are small (height and BMI) and target individuals are related to the training samples, while sparse models predict better in unrelated individuals and when some effects have moderate size (HDL). For HDL sparse models achieved good across-cohort prediction, performing similarly to the GWAMA risk score and to models trained within the same cohort, which indicates that, for predicting traits with moderately sized effects, large sample sizes and familial structure become less important, though still potentially useful. Finally, we propose a novel ensemble of whole-genome predictors with GWAMA risk scores and demonstrate that the resulting meta-model achieves higher prediction accuracy than either model on its own. We conclude that although current genomic predictors are not accurate enough for diagnostic purposes, performance can be improved without requiring access to large-scale individual-level data. Our methodologically simple meta-model is a means of performing predictive meta-analysis for optimizing genomic predictions and can be easily extended to incorporate multiple population-level summary statistics or other domain knowledge.
我们利用基因组数据探索对复杂性状个体表型的预测。我们比较了几种广泛使用的预测模型,包括从队列数据估计的岭回归、套索回归和弹性网络,以及使用全基因组关联荟萃分析(GWAMA)的已发表汇总统计数据构建的多基因风险评分。我们通过预测来自克罗地亚和苏格兰的两个数据队列中的身高、体重指数(BMI)和高密度脂蛋白水平(HDL),评估了亲缘关系、性状结构和最佳标记密度之间的相互作用。我们通过实证证明,当所有遗传效应都较小时(身高和BMI)且目标个体与训练样本相关时,密集模型表现更好,而稀疏模型在无关个体中以及当一些效应具有中等大小时(HDL)预测效果更好。对于HDL,稀疏模型实现了良好的跨队列预测,其表现与GWAMA风险评分以及在同一队列中训练的模型相似,这表明,对于预测具有中等大小效应的性状,大样本量和家族结构变得不那么重要,尽管仍然可能有用。最后,我们提出了一种将全基因组预测器与GWAMA风险评分相结合的新型方法,并证明由此产生的元模型比任何一个单独的模型都具有更高的预测准确性。我们得出结论,尽管当前的基因组预测器对于诊断目的而言不够准确,但在无需获取大规模个体水平数据的情况下性能仍可提高。我们方法简单的元模型是一种进行预测性荟萃分析以优化基因组预测的手段,并且可以轻松扩展以纳入多个人口水平的汇总统计数据或其他领域知识。