Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA.
Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA.
Am J Hum Genet. 2023 Aug 3;110(8):1319-1329. doi: 10.1016/j.ajhg.2023.06.015. Epub 2023 Jul 24.
Polygenic scores (PGSs) have emerged as a standard approach to predict phenotypes from genotype data in a wide array of applications from socio-genomics to personalized medicine. Traditional PGSs assume genotype data to be error-free, ignoring possible errors and uncertainties introduced from genotyping, sequencing, and/or imputation. In this work, we investigate the effects of genotyping error due to low coverage sequencing on PGS estimation. We leverage SNP array and low-coverage whole-genome sequencing data (lcWGS, median coverage 0.04×) of 802 individuals from the Dana-Farber PROFILE cohort to show that PGS error correlates with sequencing depth (p = 1.2 × 10). We develop a probabilistic approach that incorporates genotype error in PGS estimation to produce well-calibrated PGS credible intervals and show that the probabilistic approach increases classification accuracy by up to 6% as compared to traditional PGSs that ignore genotyping error. Finally, we use simulations to explore the combined effect of genotyping and effect size errors and their implication on PGS-based risk-stratification. Our results illustrate the importance of considering genotyping error as a source of PGS error especially for cohorts with varying genotyping technologies and/or low-coverage sequencing.
多基因评分 (PGS) 已经成为一种从基因型数据中预测表型的标准方法,在从社会基因组学到个性化医学的广泛应用中都有使用。传统的 PGS 假设基因型数据是无错误的,忽略了从基因分型、测序和/或插补引入的可能错误和不确定性。在这项工作中,我们研究了由于测序深度低导致的基因分型错误对 PGS 估计的影响。我们利用来自 Dana-Farber PROFILE 队列的 802 个人的 SNP 阵列和低覆盖全基因组测序数据(lcWGS,中位数覆盖度为 0.04×),表明 PGS 错误与测序深度相关(p = 1.2×10)。我们开发了一种概率方法,将基因型错误纳入 PGS 估计中,以产生校准良好的 PGS 置信区间,并表明与忽略基因分型错误的传统 PGS 相比,概率方法最多可将分类准确性提高 6%。最后,我们使用模拟来探索基因分型和效应大小错误的综合影响及其对基于 PGS 的风险分层的影响。我们的结果说明了考虑基因分型错误作为 PGS 错误的来源的重要性,特别是对于具有不同基因分型技术和/或低覆盖测序的队列。