Kim Seyoung, Xing Eric P
School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.
J Comput Biol. 2014 Apr;21(4):345-60. doi: 10.1089/cmb.2009.0224. Epub 2011 May 6.
A genome-wide association study involves examining a large number of single-nucleotide polymorphisms (SNPs) to identify SNPs that are significantly associated with the given phenotype, while trying to reduce the false positive rate. Although haplotype-based association methods have been proposed to accommodate correlation information across nearby SNPs that are in linkage disequilibrium, none of these methods directly incorporated the structural information such as recombination events along chromosome. In this paper, we propose a new approach called stochastic block lasso for association mapping that exploits prior knowledge on linkage disequilibrium structure in the genome such as recombination rates and distances between adjacent SNPs in order to increase the power of detecting true associations while reducing false positives. Following a typical linear regression framework with the genotypes as inputs and the phenotype as output, our proposed method employs a sparsity-enforcing Laplacian prior for the regression coefficients, augmented by a first-order Markov process along the sequence of SNPs that incorporates the prior information on the linkage disequilibrium structure. The Markov-chain prior models the structural dependencies between a pair of adjacent SNPs, and allows us to look for association SNPs in a coupled manner, combining strength from multiple nearby SNPs. Our results on HapMap-simulated datasets and mouse datasets show that there is a significant advantage in incorporating the prior knowledge on linkage disequilibrium structure for marker identification under whole-genome association.
全基因组关联研究涉及检查大量单核苷酸多态性(SNP),以识别与给定表型显著相关的SNP,同时试图降低假阳性率。尽管已经提出了基于单倍型的关联方法来处理处于连锁不平衡状态的相邻SNP之间的相关信息,但这些方法都没有直接纳入诸如沿染色体的重组事件等结构信息。在本文中,我们提出了一种称为随机块套索的新方法用于关联定位,该方法利用基因组中连锁不平衡结构的先验知识,如重组率和相邻SNP之间的距离,以提高检测真实关联的能力,同时减少假阳性。遵循以基因型为输入、表型为输出的典型线性回归框架,我们提出的方法对回归系数采用了一种强制稀疏的拉普拉斯先验,并通过沿SNP序列的一阶马尔可夫过程进行增强,该过程纳入了连锁不平衡结构的先验信息。马尔可夫链先验对一对相邻SNP之间的结构依赖性进行建模,并允许我们以耦合的方式寻找关联SNP,结合来自多个相邻SNP的强度。我们在HapMap模拟数据集和小鼠数据集上的结果表明,在全基因组关联下纳入连锁不平衡结构的先验知识用于标记识别具有显著优势。