为什么稀有变异难以推断？合并模型揭示了现有算法中的理论限制。

Why are rare variants hard to impute? Coalescent models reveal theoretical limits in existing algorithms.

机构信息

Department of Biostatistics, School of Public Health, University of Michigan, 1420 Washington Heights, Ann Arbor, MI 48109, USA.

Department of Psychiatry, University of Michigan,1420 Washington Heights, Ann Arbor, MI 48109, USA.

出版信息

Genetics. 2021 Apr 15;217(4). doi: 10.1093/genetics/iyab011.

DOI:10.1093/genetics/iyab011

PMID:33686438

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8049559/

Abstract

Genotype imputation is an indispensable step in human genetic studies. Large reference panels with deeply sequenced genomes now allow interrogating variants with minor allele frequency < 1% without sequencing. Although it is critical to consider limits of this approach, imputation methods for rare variants have only done so empirically; the theoretical basis of their imputation accuracy has not been explored. To provide theoretical consideration of imputation accuracy under the current imputation framework, we develop a coalescent model of imputing rare variants, leveraging the joint genealogy of the sample to be imputed and reference individuals. We show that broadly used imputation algorithms include model misspecifications about this joint genealogy that limit the ability to correctly impute rare variants. We develop closed-form solutions for the probability distribution of this joint genealogy and quantify the inevitable error rate resulting from the model misspecification across a range of allele frequencies and reference sample sizes. We show that the probability of a falsely imputed minor allele decreases with reference sample size, but the proportion of falsely imputed minor alleles mostly depends on the allele count in the reference sample. We summarize the impact of this error on genotype imputation on association tests by calculating the r2 between imputed and true genotype and show that even when modeling other sources of error, the impact of the model misspecification has a significant impact on the r2 of rare variants. To evaluate these predictions in practice, we compare the imputation of the same dataset across imputation panels of different sizes. Although this empirical imputation accuracy is substantially lower than our theoretical prediction, modeling misspecification seems to further decrease imputation accuracy for variants with low allele counts in the reference. These results provide a framework for developing new imputation algorithms and for interpreting rare variant association analyses.

摘要

基因型推断是人类遗传学研究中不可或缺的一步。现在，具有深度测序基因组的大型参考面板允许在不进行测序的情况下检测到等位基因频率<1%的变体。尽管考虑到这种方法的局限性至关重要，但罕见变异的推断方法仅从经验上进行了研究；其推断准确性的理论基础尚未得到探索。为了在当前推断框架下提供对推断准确性的理论考虑，我们开发了一种罕见变异推断的合并模型，利用要推断的样本和参考个体的共同系谱。我们表明，广泛使用的推断算法包括关于这种共同系谱的模型误置，限制了正确推断罕见变体的能力。我们为这个共同系谱的概率分布开发了封闭形式的解决方案，并在一系列等位基因频率和参考样本大小范围内量化了由于模型误置而产生的不可避免的错误率。我们表明，错误推断的次要等位基因的概率随参考样本量的增加而降低，但错误推断的次要等位基因的比例主要取决于参考样本中的等位基因数。我们通过计算推断基因型和真实基因型之间的 r2 来总结这种错误对关联测试中基因型推断的影响，并表明即使在对其他来源的错误进行建模时，模型误置的影响对罕见变体的 r2 也有重大影响。为了在实践中评估这些预测，我们比较了在不同大小的推断面板上对同一数据集的推断。尽管这种经验推断准确性远低于我们的理论预测，但在参考中具有低等位基因数的变体中，模型误置似乎进一步降低了推断准确性。这些结果为开发新的推断算法和解释罕见变异关联分析提供了框架。

相似文献

Why are rare variants hard to impute? Coalescent models reveal theoretical limits in existing algorithms.

Genetics. 2021 Apr 15;217(4). doi: 10.1093/genetics/iyab011.

Performance of genotype imputation for low frequency and rare variants from the 1000 genomes.

PLoS One. 2015 Jan 26;10(1):e0116487. doi: 10.1371/journal.pone.0116487. eCollection 2015.

Comprehensive evaluation of imputation performance in African Americans.

J Hum Genet. 2012 Jul;57(7):411-21. doi: 10.1038/jhg.2012.43. Epub 2012 May 31.

Rare variant genotype imputation with thousands of study-specific whole-genome sequences: implications for cost-effective study designs.

Eur J Hum Genet. 2015 Jul;23(7):975-83. doi: 10.1038/ejhg.2014.216. Epub 2014 Oct 8.

Imputation-based assessment of next generation rare exome variant arrays.

Pac Symp Biocomput. 2014:241-52.

Evaluation of the imputation performance of the program IMPUTE in an admixed sample from Mexico City using several model designs.

BMC Med Genomics. 2012 May 1;5:12. doi: 10.1186/1755-8794-5-12.

Investigating the accuracy of imputing autosomal variants in Nellore cattle using the ARS-UCD1.2 assembly of the bovine genome.

BMC Genomics. 2020 Nov 10;21(1):772. doi: 10.1186/s12864-020-07184-8.

Genotype imputation performance of three reference panels using African ancestry individuals.

Hum Genet. 2018 Apr;137(4):281-292. doi: 10.1007/s00439-018-1881-4. Epub 2018 Apr 10.

Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.

PLoS One. 2016 Aug 18;11(8):e0160733. doi: 10.1371/journal.pone.0160733. eCollection 2016.

Evaluation of the accuracy of imputed sequence variant genotypes and their utility for causal variant detection in cattle.

Genet Sel Evol. 2017 Feb 21;49(1):24. doi: 10.1186/s12711-017-0301-x.

引用本文的文献

Linking epidemiology and genomics of maternal smoking during pregnancy in utero and in ageing: a population-based study using human foetuses and the UK Biobank cohort.

EBioMedicine. 2025 Apr;114:105590. doi: 10.1016/j.ebiom.2025.105590. Epub 2025 Mar 12.

Whole-genome sequencing identifies variants in ANK1, LRRN1, HAS1, and other genes and regulatory regions for stroke in type 1 diabetes.

Sci Rep. 2024 Jun 11;14(1):13453. doi: 10.1038/s41598-024-61840-7.

The first clinical validation of whole-genome screening on standard trophectoderm biopsies of preimplantation embryos.

F S Rep. 2024 Jan 11;5(1):63-71. doi: 10.1016/j.xfre.2024.01.001. eCollection 2024 Mar.

Recent advances in polygenic scores: translation, equitability, methods and FAIR tools.

Genome Med. 2024 Feb 19;16(1):33. doi: 10.1186/s13073-024-01304-9.

Whole Genome Sequencing Identifies Novel Common and Low-Frequency Variants Associated With Age-Related Macular Degeneration.

Invest Ophthalmol Vis Sci. 2023 Nov 1;64(14):24. doi: 10.1167/iovs.64.14.24.

Genetic and epigenetic background of diabetic kidney disease.

Front Endocrinol (Lausanne). 2023 May 30;14:1163001. doi: 10.3389/fendo.2023.1163001. eCollection 2023.

Unravelling the genetic architecture of human complex traits through whole genome sequencing.

Nat Commun. 2023 Jun 14;14(1):3520. doi: 10.1038/s41467-023-39259-x.

Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits.

Nat Genet. 2023 May;55(5):768-776. doi: 10.1038/s41588-023-01379-x. Epub 2023 May 1.

A saturated map of common genetic variants associated with human height.

Nature. 2022 Oct;610(7933):704-712. doi: 10.1038/s41586-022-05275-y. Epub 2022 Oct 12.

本文引用的文献

High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios.

Cell. 2022 Sep 1;185(18):3426-3440.e19. doi: 10.1016/j.cell.2022.08.004.

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.

Nature. 2021 Feb;590(7845):290-299. doi: 10.1038/s41586-021-03205-y. Epub 2021 Feb 10.

The mutational constraint spectrum quantified from variation in 141,456 humans.

Nature. 2020 May;581(7809):434-443. doi: 10.1038/s41586-020-2308-7. Epub 2020 May 27.

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project.

Wellcome Open Res. 2019 Dec 30;4:50. doi: 10.12688/wellcomeopenres.15126.2. eCollection 2019.

A method for genome-wide genealogy estimation for thousands of samples.

Nat Genet. 2019 Sep;51(9):1321-1329. doi: 10.1038/s41588-019-0484-x. Epub 2019 Sep 2.

Estimation of DNA contamination and its sources in genotyped samples.

Genet Epidemiol. 2019 Dec;43(8):980-995. doi: 10.1002/gepi.22257. Epub 2019 Aug 26.

Comprehensive Assessment of Genotype Imputation Performance.

Hum Hered. 2018;83(3):107-116. doi: 10.1159/000489758. Epub 2019 Jan 22.

Imputation-Aware Tag SNP Selection To Improve Power for Large-Scale, Multi-ethnic Association Studies.

G3 (Bethesda). 2018 Oct 3;8(10):3255-3267. doi: 10.1534/g3.118.200502.

A One-Penny Imputed Genome from Next-Generation Reference Panels.

Am J Hum Genet. 2018 Sep 6;103(3):338-348. doi: 10.1016/j.ajhg.2018.07.015. Epub 2018 Aug 9.

Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative.

Am J Hum Genet. 2018 Jun 7;102(6):1048-1061. doi: 10.1016/j.ajhg.2018.04.001. Epub 2018 May 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

为什么稀有变异难以推断？合并模型揭示了现有算法中的理论限制。

Why are rare variants hard to impute? Coalescent models reveal theoretical limits in existing algorithms.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献