Department of Statistics, Pennsylvania State University, State College, PA 16802, USA.
Bioinformatics. 2011 Feb 15;27(4):516-23. doi: 10.1093/bioinformatics/btq688. Epub 2010 Dec 14.
Despite their success in identifying genes that affect complex disease or traits, current genome-wide association studies (GWASs) based on a single SNP analysis are too simple to elucidate a comprehensive picture of the genetic architecture of phenotypes. A simultaneous analysis of a large number of SNPs, although statistically challenging, especially with a small number of samples, is crucial for genetic modeling.
We propose a two-stage procedure for multi-SNP modeling and analysis in GWASs, by first producing a 'preconditioned' response variable using a supervised principle component analysis and then formulating Bayesian lasso to select a subset of significant SNPs. The Bayesian lasso is implemented with a hierarchical model, in which scale mixtures of normal are used as prior distributions for the genetic effects and exponential priors are considered for their variances, and then solved by using the Markov chain Monte Carlo (MCMC) algorithm. Our approach obviates the choice of the lasso parameter by imposing a diffuse hyperprior on it and estimating it along with other parameters and is particularly powerful for selecting the most relevant SNPs for GWASs, where the number of predictors exceeds the number of observations.
The new approach was examined through a simulation study. By using the approach to analyze a real dataset from the Framingham Heart Study, we detected several significant genes that are associated with body mass index (BMI). Our findings support the previous results about BMI-related SNPs and, meanwhile, gain new insights into the genetic control of this trait.
The computer code for the approach developed is available at Penn State Center for Statistical Genetics web site, http://statgen.psu.edu.
尽管基于单核苷酸多态性(SNP)分析的全基因组关联研究(GWAS)在识别影响复杂疾病或性状的基因方面取得了成功,但它们过于简单,无法阐明表型遗传结构的全貌。尽管统计上具有挑战性,尤其是在样本数量较少的情况下,同时分析大量 SNP 对于遗传建模至关重要。
我们提出了一种用于 GWAS 中多 SNP 建模和分析的两阶段程序,首先使用有监督的主成分分析生成“预处理”响应变量,然后制定贝叶斯套索选择一组重要的 SNP。贝叶斯套索使用分层模型实现,其中正态分布的混合尺度用作遗传效应的先验分布,并且考虑了它们的方差的指数先验,然后使用马尔可夫链蒙特卡罗(MCMC)算法进行求解。我们的方法通过对其施加扩散超先验来避免套索参数的选择,并与其他参数一起对其进行估计,对于选择 GWAS 中最相关的 SNP 特别有效,其中预测因子的数量超过了观测值的数量。
通过模拟研究检验了新方法。通过使用该方法分析来自弗雷明汉心脏研究的真实数据集,我们检测到了几个与体重指数(BMI)相关的显着基因。我们的发现支持了之前关于 BMI 相关 SNP 的结果,同时深入了解了该性状的遗传控制。
开发的方法的计算机代码可在宾夕法尼亚州立大学统计遗传学中心网站上获得,网址为 http://statgen.psu.edu。