Paschou Peristera, Drineas Petros, Lewis Jamey, Nievergelt Caroline M, Nickerson Deborah A, Smith Joshua D, Ridker Paul M, Chasman Daniel I, Krauss Ronald M, Ziv Elad
Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupoli, Greece.
PLoS Genet. 2008 Jul 4;4(7):e1000114. doi: 10.1371/journal.pgen.1000114.
Genetic structure in the European American population reflects waves of migration and recent gene flow among different populations. This complex structure can introduce bias in genetic association studies. Using Principal Components Analysis (PCA), we analyze the structure of two independent European American datasets (1,521 individuals-307,315 autosomal SNPs). Individual variation lies across a continuum with some individuals showing high degrees of admixture with non-European populations, as demonstrated through joint analysis with HapMap data. The CEPH Europeans only represent a small fraction of the variation encountered in the larger European American datasets we studied. We interpret the first eigenvector of this data as correlated with ancestry, and we apply an algorithm that we have previously described to select PCA-informative markers (PCAIMs) that can reproduce this structure. Importantly, we develop a novel method that can remove redundancy from the selected SNP panels and show that we can effectively remove correlated markers, thus increasing genotyping savings. Only 150-200 PCAIMs suffice to accurately predict fine structure in European American datasets, as identified by PCA. Simulating association studies, we couple our method with a PCA-based stratification correction tool and demonstrate that a small number of PCAIMs can efficiently remove false correlations with almost no loss in power. The structure informative SNPs that we propose are an important resource for genetic association studies of European Americans. Furthermore, our redundancy removal algorithm can be applied on sets of ancestry informative markers selected with any method in order to select the most uncorrelated SNPs, and significantly decreases genotyping costs.
欧裔美国人的遗传结构反映了不同人群间的移民浪潮和近期的基因流动。这种复杂结构可能会在遗传关联研究中引入偏差。我们使用主成分分析(PCA)来分析两个独立的欧裔美国人数据集的结构(1521个个体 - 307315个常染色体单核苷酸多态性)。个体变异处于一个连续体上,一些个体表现出与非欧洲人群的高度混合,这通过与HapMap数据的联合分析得以证明。CEPH欧洲人仅占我们研究的较大欧裔美国人数据集中所遇到变异的一小部分。我们将此数据的第一个特征向量解释为与祖先相关,并应用我们之前描述的一种算法来选择能够重现这种结构的PCA信息性标记(PCAIMs)。重要的是,我们开发了一种新方法,该方法可以从选定的单核苷酸多态性面板中去除冗余,并表明我们能够有效地去除相关标记,从而增加基因分型的节省。正如通过PCA所确定的,仅150 - 200个PCAIMs就足以准确预测欧裔美国人数据集中的精细结构。通过模拟关联研究,我们将我们的方法与基于PCA的分层校正工具相结合,并证明少量的PCAIMs能够有效地去除虚假相关性,且几乎不会损失检验效能。我们提出的结构信息性单核苷酸多态性是欧裔美国人遗传关联研究的重要资源。此外,我们的冗余去除算法可应用于用任何方法选择的祖先信息性标记集,以选择最不相关的单核苷酸多态性,并显著降低基因分型成本。