Wang Yining, Cai Zhipeng, Stothard Paul, Moore Steve, Goebel Randy, Wang Lusheng, Lin Guohui
Department of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada.
BMC Res Notes. 2012 Aug 3;5:404. doi: 10.1186/1756-0500-5-404.
Single nucleotide polymorphism (SNP) genotyping assays normally give rise to certain percents of no-calls; the problem becomes severe when the target organisms, such as cattle, do not have a high resolution genomic sequence. Missing SNP genotypes, when related to target traits, would confound downstream data analyses such as genome-wide association studies (GWAS). Existing methods for recovering the missing values are successful to some extent - either accurate but not fast enough or fast but not accurate enough.
To a target missing genotype, we take only the SNP loci within a genetic distance vicinity and only the samples within a similarity vicinity into our local imputation process. For missing genotype imputation, the comparative performance evaluations through extensive simulation studies using real human and cattle genotype datasets demonstrated that our nearest neighbor based local imputation method was one of the most efficient methods, and outperformed existing methods except the time-consuming fastPHASE; for missing haplotype allele imputation, the comparative performance evaluations using real mouse haplotype datasets demonstrated that our method was not only one of the most efficient methods, but also one of the most accurate methods.
Given that fastPHASE requires a long imputation time on medium to high density datasets, and that our nearest neighbor based local imputation method only performed slightly worse, yet better than all other methods, one might want to adopt our method as an alternative missing SNP genotype or missing haplotype allele imputation method.
单核苷酸多态性(SNP)基因分型检测通常会产生一定比例的无调用情况;当目标生物(如牛)没有高分辨率基因组序列时,这个问题会变得很严重。与目标性状相关的缺失SNP基因型会混淆下游数据分析,如全基因组关联研究(GWAS)。现有的恢复缺失值的方法在一定程度上是成功的——要么准确但不够快,要么快速但不够准确。
对于目标缺失基因型,我们在局部插补过程中仅纳入遗传距离附近的SNP位点以及相似性附近的样本。对于缺失基因型插补,通过使用真实人类和牛基因型数据集进行的广泛模拟研究进行的比较性能评估表明,我们基于最近邻的局部插补方法是最有效的方法之一,并且除了耗时的fastPHASE之外,优于现有方法;对于缺失单倍型等位基因插补,使用真实小鼠单倍型数据集进行的比较性能评估表明,我们的方法不仅是最有效的方法之一,也是最准确的方法之一。
鉴于fastPHASE在中高密度数据集上需要较长的插补时间,并且我们基于最近邻的局部插补方法仅略逊一筹,但优于所有其他方法,人们可能会希望采用我们的方法作为替代的缺失SNP基因型或缺失单倍型等位基因插补方法。