Department of Biostatistics, School of Public Health, University of Michigan, 1420 Washington Heights, Ann Arbor, MI 48109, USA.
Department of Psychiatry, University of Michigan,1420 Washington Heights, Ann Arbor, MI 48109, USA.
Genetics. 2021 Apr 15;217(4). doi: 10.1093/genetics/iyab011.
Genotype imputation is an indispensable step in human genetic studies. Large reference panels with deeply sequenced genomes now allow interrogating variants with minor allele frequency < 1% without sequencing. Although it is critical to consider limits of this approach, imputation methods for rare variants have only done so empirically; the theoretical basis of their imputation accuracy has not been explored. To provide theoretical consideration of imputation accuracy under the current imputation framework, we develop a coalescent model of imputing rare variants, leveraging the joint genealogy of the sample to be imputed and reference individuals. We show that broadly used imputation algorithms include model misspecifications about this joint genealogy that limit the ability to correctly impute rare variants. We develop closed-form solutions for the probability distribution of this joint genealogy and quantify the inevitable error rate resulting from the model misspecification across a range of allele frequencies and reference sample sizes. We show that the probability of a falsely imputed minor allele decreases with reference sample size, but the proportion of falsely imputed minor alleles mostly depends on the allele count in the reference sample. We summarize the impact of this error on genotype imputation on association tests by calculating the r2 between imputed and true genotype and show that even when modeling other sources of error, the impact of the model misspecification has a significant impact on the r2 of rare variants. To evaluate these predictions in practice, we compare the imputation of the same dataset across imputation panels of different sizes. Although this empirical imputation accuracy is substantially lower than our theoretical prediction, modeling misspecification seems to further decrease imputation accuracy for variants with low allele counts in the reference. These results provide a framework for developing new imputation algorithms and for interpreting rare variant association analyses.
基因型推断是人类遗传学研究中不可或缺的一步。现在,具有深度测序基因组的大型参考面板允许在不进行测序的情况下检测到等位基因频率<1%的变体。尽管考虑到这种方法的局限性至关重要,但罕见变异的推断方法仅从经验上进行了研究;其推断准确性的理论基础尚未得到探索。为了在当前推断框架下提供对推断准确性的理论考虑,我们开发了一种罕见变异推断的合并模型,利用要推断的样本和参考个体的共同系谱。我们表明,广泛使用的推断算法包括关于这种共同系谱的模型误置,限制了正确推断罕见变体的能力。我们为这个共同系谱的概率分布开发了封闭形式的解决方案,并在一系列等位基因频率和参考样本大小范围内量化了由于模型误置而产生的不可避免的错误率。我们表明,错误推断的次要等位基因的概率随参考样本量的增加而降低,但错误推断的次要等位基因的比例主要取决于参考样本中的等位基因数。我们通过计算推断基因型和真实基因型之间的 r2 来总结这种错误对关联测试中基因型推断的影响,并表明即使在对其他来源的错误进行建模时,模型误置的影响对罕见变体的 r2 也有重大影响。为了在实践中评估这些预测,我们比较了在不同大小的推断面板上对同一数据集的推断。尽管这种经验推断准确性远低于我们的理论预测,但在参考中具有低等位基因数的变体中,模型误置似乎进一步降低了推断准确性。这些结果为开发新的推断算法和解释罕见变异关联分析提供了框架。