Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri 63110, USA.
Genet Epidemiol. 2009;33 Suppl 1(Suppl 1):S88-92. doi: 10.1002/gepi.20478.
Although the importance of selecting cases and controls from the same population has been recognized for decades, the recent advent of genome-wide association studies has heightened awareness of this issue. Because these studies typically deal with large samples, small differences in allele frequencies between cases and controls can easily reach statistical significance. When, unbeknownst to a researcher, cases and controls have different substructures, the number of false-positive findings is inflated. There have been three recent developments of purely statistical approaches to assessing the ancestral comparability of case and control samples: genomic control, structured association, and multivariate reduction analyses. The widespread use of high-throughput technology has allowed the quick and accurate genotyping of the large number of markers required by these methods. Group 13 dealt with four population stratification issues: single-nucleotide polymorphism marker selection, association testing, nonstandard methods, and linkage disequilibrium calculations in stratified or mixed ethnicity samples. We demonstrated that there are continuous axes of ethnic variation in both data sets of Genetic Analysis Workshop 16. Furthermore, ignoring this structure created P-value inflation for a variety of phenotypes. Principal-components analysis (or multidimensional scaling) can control inflation as covariates in a logistic regression. One can weigh for local ancestry estimation and allow the use of related individuals. Problems arise in the presence of extremely high association or unusually strong linkage disequilibrium (e.g., in chromosomal inversions). Our group also reported a method for performing an association test controlling for substructure, when genome-wide markers are not available, to explicitly compute stratification.
尽管选择同一人群中的病例和对照的重要性已经被认识了几十年,但最近全基因组关联研究的出现使人们更加意识到这个问题。由于这些研究通常涉及大样本,病例和对照之间等位基因频率的微小差异很容易达到统计学意义。当研究人员不知道病例和对照有不同的亚结构时,假阳性发现的数量就会膨胀。最近有三种纯粹的统计方法来评估病例和对照样本的祖先可比性:基因组控制、结构关联和多元减少分析。高通量技术的广泛应用允许这些方法快速准确地对所需的大量标记进行基因分型。第 13 组处理了四个群体分层问题:单核苷酸多态性标记选择、关联测试、非标准方法以及分层或混合种族样本中的连锁不平衡计算。我们证明,在遗传分析研讨会 16 的两个数据集都存在连续的种族变异轴。此外,忽略这种结构会导致各种表型的 P 值膨胀。主成分分析(或多维缩放)可以在逻辑回归中作为协变量来控制膨胀。人们可以为局部祖先估计加权,并允许使用相关个体。在存在极高关联或异常强连锁不平衡(例如染色体倒位)的情况下会出现问题。我们的小组还报告了一种在没有全基因组标记的情况下控制亚结构进行关联测试的方法,以明确计算分层。