Ke Xiayi, Durrant Caroline, Morris Andrew P, Hunt Sarah, Bentley David R, Deloukas Panos, Cardon Lon R
Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, UK.
Hum Mol Genet. 2004 Nov 1;13(21):2557-65. doi: 10.1093/hmg/ddh294. Epub 2004 Sep 14.
Haplotype tagging is a means of retaining most of the information in high density marker maps, while reducing genotyping requirements. Estimates of the numbers of tagging SNPs required to cover the human genome have varied widely, ranging from 100,000 to 1,000,000. Tagging has been applied to a number of gene-based datasets but has not been evaluated in contexts reflecting those of genome-wide association studies--large chromosome regions and multiple samples drawn from the same population. We analysed 5000 common markers across a 10 Mb segment of human chromosome 20 in three samples (UK Caucasian, CEPH Caucasian, African American) to evaluate tagging efficiency and consistency. Overall, the results indicate a high degree of efficiency, yielding 3-5-fold savings in Caucasians and 2-3-fold savings in African Americans. These levels varied according to linkage disequilibrium (LD) levels, tagging thresholds and allele frequencies, but in high LD regions they did not vary markedly due to marker density. However, a strong positive relationship between marker density and tagging was observed, relating to the fact that increasing marker density yields greater sequence coverage in high LD, thus requiring more tag SNPs to cover a greater fraction of the genome. Encouragingly, whatever the density employed, a high level of robustness was observed between UK and CEPH samples, as most of the htSNPs selected in one sample were also appropriate as tags in the other.
单倍型标签是一种在保留高密度标记图谱中大部分信息的同时,降低基因分型要求的方法。覆盖人类基因组所需的标签单核苷酸多态性(SNP)数量估计差异很大,从10万到100万不等。标签法已应用于一些基于基因的数据集,但尚未在反映全基因组关联研究的背景下进行评估——即大的染色体区域和从同一人群中抽取的多个样本。我们在三个样本(英国白种人、CEPH白种人、非裔美国人)中分析了人类20号染色体10 Mb区段上的5000个常见标记,以评估标签效率和一致性。总体而言,结果表明效率很高,在白种人中节省了3至5倍,在非裔美国人中节省了2至3倍。这些水平因连锁不平衡(LD)水平、标签阈值和等位基因频率而异,但在高LD区域,它们不会因标记密度而有明显变化。然而,观察到标记密度与标签之间存在很强的正相关关系,这与标记密度增加在高LD中产生更大的序列覆盖范围有关,因此需要更多的标签SNP来覆盖更大比例的基因组。令人鼓舞的是,无论采用何种密度,在英国样本和CEPH样本之间都观察到了高度的稳健性,因为在一个样本中选择的大多数标签SNP在另一个样本中也同样适用。