Roberts Adam, McMillan Leonard, Wang Wei, Parker Joel, Rusyn Ivan, Threadgill David
Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA.
Bioinformatics. 2007 Jul 1;23(13):i401-7. doi: 10.1093/bioinformatics/btm220.
Typical high-throughput genotyping techniques produce numerous missing calls that confound subsequent analyses, such as disease association studies. Common remedies for this problem include removing affected markers and/or samples or, otherwise, imputing the missing data. On small marker sets imputation is frequently based on a vote of the K-nearest-neighbor (KNN) haplotypes, but this technique is neither practical nor justifiable for large datasets.
We describe a data structure that supports efficient KNN queries over arbitrarily sized, sliding haplotype windows, and evaluate its use for genotype imputation. The performance of our method enables exhaustive exploration over all window sizes and known sites in large (150K, 8.3M) SNP panels. We also compare the accuracy and performance of our methods with competing imputation approaches.
A free open source software package, NPUTE, is available at http://compgen.unc.edu/software, for non-commercial uses.
典型的高通量基因分型技术会产生大量缺失值,这会干扰后续分析,如疾病关联研究。针对此问题的常见补救措施包括去除受影响的标记和/或样本,或者对缺失数据进行插补。在小标记集上,插补通常基于K近邻(KNN)单倍型的投票,但该技术对于大型数据集既不实用也不合理。
我们描述了一种数据结构,它支持在任意大小的滑动单倍型窗口上进行高效的KNN查询,并评估其在基因型插补中的应用。我们方法的性能使得能够对大型(150K、830万)SNP面板中的所有窗口大小和已知位点进行详尽探索。我们还将我们方法的准确性和性能与竞争的插补方法进行了比较。
一个免费的开源软件包NPUTE可从http://compgen.unc.edu/software获取,供非商业使用。