Center for Research on Genomics and Global Health, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892-5635, USA.
Genet Epidemiol. 2010 Apr;34(3):258-65. doi: 10.1002/gepi.20457.
Imputation of genotypes for markers untyped in a study sample has become a standard approach to increase genome coverage in genome-wide association studies at practically zero cost. Most methods for imputing missing genotypes extend previously described algorithms for inferring haplotype phase. These algorithms generally fall into three classes based on the underlying model for estimating the conditional distribution of haplotype frequencies: a cluster-based model, a multinomial model, or a population genetics-based model. We compared BEAGLE, PLINK, and MACH, representing the three classes of models, respectively, with specific attention to measures of imputation success and selection of the reference panel for an admixed study sample of African Americans. Based on analysis of chromosome 22 and after calibration to a fixed level of 90% concordance between experimentally determined and imputed genotypes, MACH yielded the largest absolute number of successfully imputed markers and the largest gain in coverage of the variation captured by HapMap reference panels. Following the common practice of performing imputation once, the Yoruba in Ibadan, Nigeria (YRI) reference panel outperformed other HapMap reference panels, including (1) African ancestry from Southwest USA (ASW) data, (2) an unweighted combination of the Northern and Western Europe (CEU) and YRI data into a single reference panel, and (3) a combination of the CEU and YRI data into a single reference panel with weights matching estimates of admixture proportions. For our admixed study sample, the optimal strategy involved imputing twice with the HapMap CEU and YRI reference panels separately and then merging the data sets.
在研究样本中对未分型标记进行基因型推断已成为一种增加全基因组关联研究中基因组覆盖度的标准方法,几乎不需要任何成本。大多数用于推断缺失基因型的方法扩展了先前描述的推断单倍型相位的算法。这些算法通常基于估计单倍型频率条件分布的基础模型分为三类:基于聚类的模型、多项模型或基于群体遗传学的模型。我们比较了分别代表这三种模型的 BEAGLE、PLINK 和 MACH,特别关注了推断成功率和选择参考面板的措施,以适应非洲裔美国人的混合研究样本。基于对 22 号染色体的分析,并在与实验确定的基因型和推断的基因型之间 90%的一致性的固定水平校准后,MACH 产生了最多数量的成功推断标记和最多的 HapMap 参考面板捕获的变异覆盖度增益。在执行一次推断的常见实践之后,尼日利亚伊巴丹的约鲁巴人(YRI)参考面板优于其他 HapMap 参考面板,包括(1)来自美国西南部的非洲裔(ASW)数据,(2)将北欧和西欧(CEU)和 YRI 数据加权组合成一个单一的参考面板,以及(3)将 CEU 和 YRI 数据组合成一个参考面板,其权重与混合比例的估计值匹配。对于我们的混合研究样本,最佳策略涉及两次使用 HapMap CEU 和 YRI 参考面板进行推断,然后合并数据集。