Clark Andrew G, Hubisz Melissa J, Bustamante Carlos D, Williamson Scott H, Nielsen Rasmus
Molecular Biology and Genetics and Computational Biology, Cornell University, Ithaca, New York 14853, USA.
Genome Res. 2005 Nov;15(11):1496-502. doi: 10.1101/gr.4107905.
Large-scale SNP genotyping studies rely on an initial assessment of nucleotide variation to identify sites in the DNA sequence that harbor variation among individuals. This "SNP discovery" sample may be quite variable in size and composition, and it has been well established that properties of the SNPs that are found are influenced by the discovery sampling effort. The International HapMap project relied on nearly any piece of information available to identify SNPs-including BAC end sequences, shotgun reads, and differences between public and private sequences-and even made use of chimpanzee data to confirm human sequence differences. In addition, the ascertainment criteria shifted from using only SNPs that had been validated in population samples, to double-hit SNPs, to finally accepting SNPs that were singletons in small discovery samples. In contrast, Perlegen's primary discovery was a resequencing-by-hybridization effort using the 24 people of diverse origin in the Polymorphism Discovery Resource. Here we take these two data sets and contrast two basic summary statistics, heterozygosity and F(ST), as well as the site frequency spectra, for 500-kb windows spanning the genome. The magnitude of disparity between these samples in these measures of variability indicates that population genetic analysis on the raw genotype data is ill advised. Given the knowledge of the discovery samples, we perform an ascertainment correction and show how the post-correction data are more consistent across these studies. However, discrepancies persist, suggesting that the heterogeneity in the SNP discovery process of the HapMap project resulted in a data set resistant to complete ascertainment correction. Ascertainment bias will likely erode the power of tests of association between SNPs and complex disorders, but the effect will likely be small, and perhaps more importantly, it is unlikely that the bias will introduce false-positive inferences.
大规模单核苷酸多态性(SNP)基因分型研究依赖于对核苷酸变异的初步评估,以识别DNA序列中个体间存在变异的位点。这个“单核苷酸多态性发现”样本在大小和组成上可能有很大差异,并且已经明确发现的单核苷酸多态性的特性会受到发现抽样工作的影响。国际人类基因组单体型图计划(International HapMap project)依靠几乎所有可用信息来识别单核苷酸多态性,包括细菌人工染色体(BAC)末端序列、鸟枪法测序读数以及公共和私有序列之间的差异,甚至利用黑猩猩数据来确认人类序列差异。此外,确定标准从仅使用在群体样本中已验证的单核苷酸多态性,转变为双击中的单核苷酸多态性,最终接受在小发现样本中为单例的单核苷酸多态性。相比之下,Perlegen的主要发现是利用多态性发现资源中24个不同来源的人进行杂交重测序工作。在这里,我们采用这两个数据集,对比两个基本的汇总统计量,杂合度和F(ST),以及跨越基因组的500千碱基窗口的位点频率谱。这些样本在这些变异性度量上的差异程度表明对原始基因型数据进行群体遗传学分析是不明智的。鉴于对发现样本的了解,我们进行了确定校正,并展示了校正后的数据在这些研究中如何更一致。然而,差异仍然存在,这表明人类基因组单体型图计划单核苷酸多态性发现过程中的异质性导致了一个难以完全进行确定校正的数据集。确定偏倚可能会削弱单核苷酸多态性与复杂疾病之间关联测试的效力,但影响可能较小,也许更重要的是,这种偏倚不太可能引入假阳性推断。