充分利用聚类和阈值处理多基因评分。

Making the Most of Clumping and Thresholding for Polygenic Scores.

机构信息

Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes, CNRS, La Tronche, France; Department of Economics and Business Economics, National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark.

Department of Economics and Business Economics, National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark.

出版信息

Am J Hum Genet. 2019 Dec 5;105(6):1213-1221. doi: 10.1016/j.ajhg.2019.11.001. Epub 2019 Nov 21.

DOI:10.1016/j.ajhg.2019.11.001

PMID:31761295

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6904799/

Abstract

Polygenic prediction has the potential to contribute to precision medicine. Clumping and thresholding (C+T) is a widely used method to derive polygenic scores. When using C+T, several p value thresholds are tested to maximize predictive ability of the derived polygenic scores. Along with this p value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123K different C+T scores for 300K individuals and 1M variants using 16 physical cores. We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p value threshold to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T. We further propose stacked clumping and thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to eight different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.

摘要

多基因预测有可能为精准医学做出贡献。聚类和阈值（C+T）是一种广泛用于衍生多基因评分的方法。在使用 C+T 时，会测试多个 p 值阈值以最大化衍生多基因评分的预测能力。除了这个 p 值阈值，我们还提出调整 C+T 的另外三个超参数。我们实现了一种有效的方法，可以根据四个超参数的网格来衍生数千种不同的 C+T 评分。例如，使用 16 个物理核心，为 30 万个体和 100 万个变体推导 123K 种不同的 C+T 评分只需要几个小时。我们发现，与仅调整 p 值阈值相比，对这四个超参数进行优化可以提高 C+T 在模拟和真实数据应用中的预测性能。当预测抑郁状态时，这种改进尤其明显，从仅调整 p 值阈值时的 AUC 为 0.557（95%CI：[0.544-0.569]）提高到调整我们提出的所有四个超参数时的 AUC 为 0.592（95%CI：[0.580-0.604]）。我们进一步提出了堆叠聚类和阈值（SCT），这是一种源自所有衍生 C+T 评分的多基因评分。SCT 不是选择一组在某些训练集中最大化预测的超参数，而是通过使用有效的惩罚回归来学习所有 C+T 评分的最佳线性组合。我们将 SCT 应用于 UK Biobank 数据中的八种不同的病例对照疾病，并发现 SCT 可以显著提高预测准确性，平均 AUC 增加 0.035 以上。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

充分利用聚类和阈值处理多基因评分。

Making the Most of Clumping and Thresholding for Polygenic Scores.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

充分利用聚类和阈值处理多基因评分。

Making the Most of Clumping and Thresholding for Polygenic Scores.

机构信息

出版信息