Zhang Kui, Qin Zhaohui S, Liu Jun S, Chen Ting, Waterman Michael S, Sun Fengzhu
Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, California 90089-1113, USA.
Genome Res. 2004 May;14(5):908-16. doi: 10.1101/gr.1837404. Epub 2004 Apr 12.
Recent studies have revealed that linkage disequilibrium (LD) patterns vary across the human genome with some regions of high LD interspersed by regions of low LD. A small fraction of SNPs (tag SNPs) is sufficient to capture most of the haplotype structure of the human genome. In this paper, we develop a method to partition haplotypes into blocks and to identify tag SNPs based on genotype data by combining a dynamic programming algorithm for haplotype block partitioning and tag SNP selection based on haplotype data with a variation of the expectation maximization (EM) algorithm for haplotype inference. We assess the effects of using either haplotype or genotype data in haplotype block identification and tag SNP selection as a function of several factors, including sample size, density or number of SNPs studied, allele frequencies, fraction of missing data, and genotyping error rate, using extensive simulations. We find that a modest number of haplotype or genotype samples will result in consistent block partitions and tag SNP selection. The power of association studies based on tag SNPs using genotype data is similar to that using haplotype data.
最近的研究表明,连锁不平衡(LD)模式在人类基因组中各不相同,一些高LD区域与低LD区域相间分布。一小部分单核苷酸多态性(标签SNP)足以捕获人类基因组的大部分单倍型结构。在本文中,我们开发了一种方法,通过将基于单倍型数据的单倍型块划分和标签SNP选择的动态规划算法与用于单倍型推断的期望最大化(EM)算法的变体相结合,根据基因型数据将单倍型划分为块并识别标签SNP。我们使用广泛的模拟,评估了在单倍型块识别和标签SNP选择中使用单倍型或基因型数据的效果,该效果是几个因素的函数,包括样本大小、研究的SNP的密度或数量、等位基因频率、缺失数据的比例以及基因分型错误率。我们发现,适量数量的单倍型或基因型样本将导致一致的块划分和标签SNP选择。基于标签SNP使用基因型数据的关联研究的效能与使用单倍型数据的相似。