Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes, CNRS, La Tronche, France; Department of Economics and Business Economics, National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark.
Department of Economics and Business Economics, National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark.
Am J Hum Genet. 2019 Dec 5;105(6):1213-1221. doi: 10.1016/j.ajhg.2019.11.001. Epub 2019 Nov 21.
Polygenic prediction has the potential to contribute to precision medicine. Clumping and thresholding (C+T) is a widely used method to derive polygenic scores. When using C+T, several p value thresholds are tested to maximize predictive ability of the derived polygenic scores. Along with this p value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123K different C+T scores for 300K individuals and 1M variants using 16 physical cores. We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p value threshold to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T. We further propose stacked clumping and thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to eight different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.
多基因预测有可能为精准医学做出贡献。聚类和阈值(C+T)是一种广泛用于衍生多基因评分的方法。在使用 C+T 时,会测试多个 p 值阈值以最大化衍生多基因评分的预测能力。除了这个 p 值阈值,我们还提出调整 C+T 的另外三个超参数。我们实现了一种有效的方法,可以根据四个超参数的网格来衍生数千种不同的 C+T 评分。例如,使用 16 个物理核心,为 30 万个体和 100 万个变体推导 123K 种不同的 C+T 评分只需要几个小时。我们发现,与仅调整 p 值阈值相比,对这四个超参数进行优化可以提高 C+T 在模拟和真实数据应用中的预测性能。当预测抑郁状态时,这种改进尤其明显,从仅调整 p 值阈值时的 AUC 为 0.557(95%CI:[0.544-0.569])提高到调整我们提出的所有四个超参数时的 AUC 为 0.592(95%CI:[0.580-0.604])。我们进一步提出了堆叠聚类和阈值(SCT),这是一种源自所有衍生 C+T 评分的多基因评分。SCT 不是选择一组在某些训练集中最大化预测的超参数,而是通过使用有效的惩罚回归来学习所有 C+T 评分的最佳线性组合。我们将 SCT 应用于 UK Biobank 数据中的八种不同的病例对照疾病,并发现 SCT 可以显著提高预测准确性,平均 AUC 增加 0.035 以上。