Meng Zhaoling, Zaykin Dmitri V, Xu Chun-Fang, Wagner Michael, Ehm Margaret G
Bioinformatics Research Center, Campus Box 7566, North Carolina State University, Raleigh, NC 27695-7566, USA.
Am J Hum Genet. 2003 Jul;73(1):115-30. doi: 10.1086/376561. Epub 2003 Jun 5.
The genotyping of closely spaced single-nucleotide polymorphism (SNP) markers frequently yields highly correlated data, owing to extensive linkage disequilibrium (LD) between markers. The extent of LD varies widely across the genome and drives the number of frequent haplotypes observed in small regions. Several studies have illustrated the possibility that LD or haplotype data could be used to select a subset of SNPs that optimize the information retained in a genomic region while reducing the genotyping effort and simplifying the analysis. We propose a method based on the spectral decomposition of the matrices of pairwise LD between markers, and we select markers on the basis of their contributions to the total genetic variation. We also modify Clayton's "haplotype tagging SNP" selection method, which utilizes haplotype information. For both methods, we propose sliding window-based algorithms that allow the methods to be applied to large chromosomal regions. Our procedures require genotype information about a small number of individuals for an initial set of SNPs and selection of an optimum subset of SNPs that could be efficiently genotyped on larger numbers of samples while retaining most of the genetic variation in samples. We identify suitable parameter combinations for the procedures, and we show that a sample size of 50-100 individuals achieves consistent results in studies of simulated data sets in linkage equilibrium and LD. When applied to experimental data sets, both procedures were similarly effective at reducing the genotyping requirement while maintaining the genetic information content throughout the regions. We also show that haplotype-association results that Hosking et al. obtained near CYP2D6 were almost identical before and after marker selection.
由于标记之间存在广泛的连锁不平衡(LD),紧密间隔的单核苷酸多态性(SNP)标记的基因分型常常产生高度相关的数据。LD的程度在全基因组中差异很大,并决定了在小区域中观察到的常见单倍型的数量。多项研究表明,LD或单倍型数据可用于选择SNP的一个子集,该子集在减少基因分型工作并简化分析的同时,能优化保留在基因组区域中的信息。我们提出了一种基于标记间成对LD矩阵谱分解的方法,并根据标记对总遗传变异的贡献来选择标记。我们还修改了利用单倍型信息的克莱顿“单倍型标签SNP”选择方法。对于这两种方法,我们都提出了基于滑动窗口的算法,使这些方法能够应用于大的染色体区域。我们的程序需要关于一小部分个体的初始SNP集的基因型信息,并选择一个最佳的SNP子集,该子集可以在大量样本上进行高效基因分型,同时保留样本中的大部分遗传变异。我们确定了这些程序的合适参数组合,并表明在连锁平衡和LD的模拟数据集研究中,50 - 100个个体的样本量能获得一致的结果。当应用于实验数据集时,这两种程序在减少基因分型需求同时保持整个区域的遗传信息含量方面同样有效。我们还表明,霍斯金等人在CYP2D6附近获得的单倍型关联结果在标记选择前后几乎相同。