Nickerson D A, Taylor S L, Fullerton S M, Weiss K M, Clark A G, Stengård J H, Salomaa V, Boerwinkle E, Sing C F
Department of Molecular Biotechnology, University of Washington, Seattle, Washington 98195, USA.
Genome Res. 2000 Oct;10(10):1532-45. doi: 10.1101/gr.146900.
A common strategy for genotyping large samples begins with the characterization of human single nucleotide polymorphisms (SNPs) by sequencing candidate regions in a small sample for SNP discovery. This is usually followed by typing in a large sample those sites observed to vary in a smaller sample. We present results from a systematic investigation of variation at the human apolipoprotein E locus (APOE), as well as the evaluation of the two-tiered sampling strategy based on these data. We sequenced 5.5 kb spanning the entire APOE genomic region in a core sample of 72 individuals, including 24 each of African-Americans from Jackson, Mississippi; European-Americans from Rochester, Minnesota; and Europeans from North Karelia, Finland. This sequence survey detected 21 SNPs and 1 multiallelic indel, 14 of which had not been previously reported. Alleles varied in relative frequency among the populations, and 10 sites were polymorphic in only a single population sample. Oligonucleotide ligation assays (OLA) were developed for 20 of these sites (omitting the indel and a closely-linked SNP). These were then scored in 2179 individuals sampled from the same three populations (n = 843, 884, and 452, respectively). Relative allele frequencies were generally consistent with estimates from the core sample, although variation was found in some populations in the larger sample at SNPs that were monomorphic in the corresponding smaller core sample. Site variation in the larger samples showed no systematic deviation from Hardy-Weinberg expectation. The large OLA sample clearly showed that variation in many, but not all, of OLA-typed SNPs is significantly correlated with the classical protein-coding variants, implying that there may be important substructure within the classical epsilon 2, epsilon 3, and epsilon 4 alleles. Comparison of the levels and patterns of polymorphism in the core samples with those estimated for the OLA-typed samples shows how nucleotide diversity is underestimated when only a subset of sites are typed and underscores the importance of adequate population sampling at the polymorphism discovery stage. [The sequence data described in this paper have been submitted to the GenBank data library under accession no. AF261279.]
对大量样本进行基因分型的常用策略是,首先通过对小样本中的候选区域进行测序来发现单核苷酸多态性(SNP),从而对人类SNP进行特征描述。接下来通常是在大样本中对那些在小样本中观察到存在变异的位点进行分型。我们展示了对人类载脂蛋白E基因座(APOE)变异进行系统研究的结果,以及基于这些数据对两级抽样策略的评估。我们对72名个体的核心样本中跨越整个APOE基因组区域的5.5 kb进行了测序,其中包括来自密西西比州杰克逊的24名非裔美国人、来自明尼苏达州罗切斯特的24名欧裔美国人以及来自芬兰北卡累利阿的24名欧洲人。该序列调查检测到21个SNP和1个多等位基因插入缺失,其中14个此前未被报道。等位基因在不同人群中的相对频率有所不同,并且有10个位点仅在单个群体样本中具有多态性。针对其中20个位点(不包括插入缺失和一个紧密连锁的SNP)开发了寡核苷酸连接分析(OLA)。然后在从相同的三个人群中抽取的2179名个体(分别为n = 843、884和452)中对这些位点进行评分。相对等位基因频率总体上与核心样本的估计值一致,尽管在较大样本中,一些在相应较小核心样本中为单态的SNP在某些人群中发现存在变异。较大样本中的位点变异未显示出与哈迪-温伯格预期存在系统偏差。大型OLA样本清楚地表明,许多(但并非全部)经OLA分型的SNP的变异与经典蛋白质编码变体显著相关,这意味着在经典的ε2、ε3和ε4等位基因中可能存在重要的亚结构。将核心样本中的多态性水平和模式与经OLA分型样本的估计值进行比较,显示出仅对部分位点进行分型时核苷酸多样性是如何被低估的,并强调了在多态性发现阶段进行充分群体抽样的重要性。[本文所述的序列数据已提交至GenBank数据库,登录号为AF261279。]