Microsoft Research, Redmond, Washington, United States of America.
PLoS One. 2011;6(7):e21591. doi: 10.1371/journal.pone.0021591. Epub 2011 Jul 12.
Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science.
理解遗传变异在人类疾病中的作用仍然是基因组学中一个有待解决的重要问题。这种变异的一个重要组成部分是 DNA 中单一位点的变异,或单核苷酸多态性 (SNP)。通常,将特定 SNP 与表型相关联的问题受到隐藏因素的混淆,例如所分析个体样本中存在群体结构、家族结构或隐性亲缘关系。这些混杂因素导致大量虚假关联和遗漏关联。已经提出了各种统计方法来解释这些混杂因素,例如线性混合效应模型 (LMM) 或基于主成分分析 (PCA) 调整数据的方法,但这些方法要么存在低功效问题,要么在样本中个体数量较大时变得难以处理。在这里,我们提出了一种用于进行全基因组关联研究 (GWAS) 的统计模型,该模型可以解释这些混杂因素。与基于 LMM 和 PCA 的方法相比,我们的方法在对从广义 LMM 生成的合成数据进行测试时,其运行时间与个体数量呈二次关系,仅略微降低了统计功效。将我们的方法应用于真实和合成人类基因型/表型数据,我们证明了我们的模型在纠正混杂因素的同时,相对于 LMM 能够显著减少运行时间的能力。我们已经实现了适合这些模型的方法,可在 http://www.microsoft.com/science 上获得。