Department of Radiology, University of California, San Diego, La Jolla, CA, USA.
Department of Radiation Sciences, Umeå University, Umeå, Sweden.
Bioinformatics. 2019 Jan 1;35(1):1-11. doi: 10.1093/bioinformatics/bty472.
Multiple marker analysis of the genome-wide association study (GWAS) data has gained ample attention in recent years. However, because of the ultra high-dimensionality of GWAS data, such analysis is challenging. Frequently used penalized regression methods often lead to large number of false positives, whereas Bayesian methods are computationally very expensive. Motivated to ameliorate these issues simultaneously, we consider the novel approach of using non-local priors in an iterative variable selection framework.
We develop a variable selection method, named, iterative non-local prior based selection for GWAS, or GWASinlps, that combines, in an iterative variable selection framework, the computational efficiency of the screen-and-select approach based on some association learning and the parsimonious uncertainty quantification provided by the use of non-local priors. The hallmark of our method is the introduction of 'structured screen-and-select' strategy, that considers hierarchical screening, which is not only based on response-predictor associations, but also based on response-response associations and concatenates variable selection within that hierarchy. Extensive simulation studies with single nucleotide polymorphisms having realistic linkage disequilibrium structures demonstrate the advantages of our computationally efficient method compared to several frequentist and Bayesian variable selection methods, in terms of true positive rate, false discovery rate, mean squared error and effect size estimation error. Further, we provide empirical power analysis useful for study design. Finally, a real GWAS data application was considered with human height as phenotype.
An R-package for implementing the GWASinlps method is available at https://cran.r-project.org/web/packages/GWASinlps/index.html.
Supplementary data are available at Bioinformatics online.
近年来,对全基因组关联研究(GWAS)数据的多标记分析引起了广泛关注。然而,由于 GWAS 数据的超高维性,这种分析具有挑战性。常用的惩罚回归方法经常导致大量的假阳性,而贝叶斯方法在计算上非常昂贵。为了同时改善这些问题,我们考虑在迭代变量选择框架中使用非局部先验的新方法。
我们开发了一种变量选择方法,称为基于非局部先验的 GWAS 迭代选择(GWASinlps),它在迭代变量选择框架中结合了基于一些关联学习的筛选和选择方法的计算效率,以及使用非局部先验提供的简约不确定性量化。我们方法的特点是引入了“结构化筛选和选择”策略,该策略不仅基于响应-预测因子的关联,还基于响应-响应的关联,并在该层次结构内串联变量选择。具有现实连锁不平衡结构的单核苷酸多态性的广泛模拟研究表明,与几种频率主义和贝叶斯变量选择方法相比,我们的计算效率方法在真阳性率、假发现率、均方误差和效应大小估计误差方面具有优势。此外,我们还提供了有用的研究设计的经验功效分析。最后,考虑了人类身高作为表型的真实 GWAS 数据应用。
可在 https://cran.r-project.org/web/packages/GWASinlps/index.html 上获得用于实现 GWASinlps 方法的 R 包。
补充数据可在生物信息学在线获得。