School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
Bioinformatics. 2010 Jun 15;26(12):i208-16. doi: 10.1093/bioinformatics/btq191.
Population heterogeneity through admixing of different founder populations can produce spurious associations in genome-wide association studies that are linked to the population structure rather than the phenotype. Since samples from the same population generally co-evolve, different populations may or may not share the same genetic underpinnings for the seemingly common phenotype. Our goal is to develop a unified framework for detecting causal genetic markers through a joint association analysis of multiple populations.
Based on a multi-task regression principle, we present a multi-population group lasso algorithm using L(1)/L(2)-regularized regression for joint association analysis of multiple populations that are stratified either via population survey or computational estimation. Our algorithm combines information from genetic markers across populations, to identify causal markers. It also implicitly accounts for correlations between the genetic markers, thus enabling better control over false positive rates. Joint analysis across populations enables the detection of weak associations common to all populations with greater power than in a separate analysis of each population. At the same time, the regression-based framework allows causal alleles that are unique to a subset of the populations to be correctly identified. We demonstrate the effectiveness of our method on HapMap-simulated and lactase persistence datasets, where we significantly outperform state of the art methods, with greater power for detecting weak associations and reduced spurious associations.
Software will be available at http://www.sailing.cs.cmu.edu/.
通过不同创始人群体的混合,人群异质性可能会在全基因组关联研究中产生与群体结构而非表型相关的虚假关联。由于来自同一群体的样本通常共同进化,因此不同群体可能具有也可能不具有相同的遗传基础来解释看似常见的表型。我们的目标是通过对多个群体的联合关联分析,开发一种用于检测因果遗传标记的统一框架。
基于多任务回归原理,我们提出了一种多群体组套索算法,该算法使用 L(1)/L(2)-正则化回归对通过群体调查或计算估计分层的多个群体进行联合关联分析。我们的算法结合了来自不同群体的遗传标记的信息,以识别因果标记。它还隐式地考虑了遗传标记之间的相关性,从而能够更好地控制假阳性率。跨群体的联合分析能够以比每个群体分别分析更高的功效检测到所有群体共有的弱关联。同时,基于回归的框架允许正确识别仅存在于部分群体中的因果等位基因。我们在 HapMap 模拟和乳糖耐受数据集上验证了我们方法的有效性,在这些数据集上,我们的方法显著优于最新方法,能够更有效地检测到弱关联,并减少虚假关联。
软件将可在 http://www.sailing.cs.cmu.edu/ 上获得。