Reprogen - Animal Bioscience, Faculty of Veterinary Science, University of Sydney, 425 Werombi Road, Camden, NSW 2570, Australia.
BMC Genomics. 2012 Oct 8;13:538. doi: 10.1186/1471-2164-13-538.
We investigated strategies and factors affecting accuracy of imputing genotypes from lower-density SNP panels (Illumina 3K, 7K, Affymetrix 15K and 25K, and evenly spaced subsets) up to one medium (Illumina 50K) and one high-density (Illumina 800K) SNP panel. We also evaluated the utility of imputed genotypes on the accuracy of genomic selection using Australian Holstein-Friesian cattle data from 2727 and 845 animals genotyped with 50K and 800K SNP chip, respectively. Animals were divided into reference and test sets (genotyped with higher and lower density SNP panels, respectively) for evaluating the accuracies of imputation. For the accuracy of genomic selection, a comparison of direct genetic values (DGV) was made by dividing the data into training and validation sets under a range of imputation scenarios.
Of the three methods compared for imputation, IMPUTE2 outperformed Beagle and fastPhase for almost all scenarios. Higher SNP densities in the test animals, larger reference sets and higher relatedness between test and reference animals increased the accuracy of imputation. 50K specific genotypes were imputed with moderate allelic error rates from 15K (2.85%) and 25K (2.75%) genotypes. Using IMPUTE2, SNP genotypes up to 800K were imputed with low allelic error rate (0.79% genome-wide) from 50K genotypes, and with moderate error rate from 3K (4.78%) and 7K (2.00%) genotypes. The error rate of imputing up to 800K from 3K or 7K was further reduced when an additional middle tier of 50K genotypes was incorporated in a 3-tiered framework. Accuracies of DGV for five production traits using imputed 50K genotypes were close to those obtained with the actual 50K genotypes and higher compared to using 3K or 7K genotypes. The loss in accuracy of DGV was small when most of the training animals also had imputed (50K) genotypes. Additional gains in DGV accuracies were small when SNP densities increased from 50K to imputed 800K.
Population-based genotype imputation can be used to predict and combine genotypes from different low, medium and high-density SNP chips with a high level of accuracy. Imputing genotypes from low-density SNP panels to at least 50K SNP density increases the accuracy of genomic selection.
我们研究了从较低密度 SNP 面板(Illumina 3K、7K、Affymetrix 15K 和 25K 以及均匀间隔的子集)到一个中等密度(Illumina 50K)和一个高密度(Illumina 800K)SNP 面板进行基因型推断的策略和影响准确性的因素。我们还使用澳大利亚荷斯坦-弗里森牛的 2727 头和 845 头动物的数据,评估了使用这些较低密度和较高密度 SNP 芯片分别获得的 50K 和 800K SNP 芯片基因型推断对基因组选择准确性的效用。动物被分为参考组和测试组(分别用较高和较低密度 SNP 面板进行基因分型),以评估推断的准确性。对于基因组选择的准确性,通过在一系列推断场景下将数据分为训练集和验证集,比较了直接遗传值(DGV)。
在比较的三种方法中,IMPUTE2 在几乎所有情况下都优于 Beagle 和 fastPhase。测试动物中的 SNP 密度更高、参考组更大以及测试动物和参考动物之间的相关性更高,都会提高推断的准确性。50K 特有的基因型可以从中等密度的 15K(2.85%)和 25K(2.75%)基因型中以中等等位基因错误率进行推断。使用 IMPUTE2,从 50K 基因型中可以以低等位基因错误率(全基因组 0.79%)推断高达 800K 的 SNP 基因型,从 3K(4.78%)和 7K(2.00%)基因型中可以以中等错误率进行推断。当在三层框架中加入额外的 50K 中间层时,从 3K 或 7K 推断高达 800K 的错误率进一步降低。使用推断的 50K 基因型获得的五个生产性状的 DGV 准确性接近实际 50K 基因型的准确性,且高于使用 3K 或 7K 基因型的准确性。当大多数训练动物也具有推断的(50K)基因型时,DGV 准确性的损失很小。当 SNP 密度从 50K 增加到推断的 800K 时,DGV 准确性的额外提高很小。
基于群体的基因型推断可用于以高精度预测和组合来自不同低、中、高密度 SNP 芯片的基因型。从较低密度 SNP 面板推断基因型至至少 50K SNP 密度可提高基因组选择的准确性。