Zhu Xiang, Stephens Matthew
University of Chicago.
Ann Appl Stat. 2017;11(3):1561-1592. doi: 10.1214/17-aoas1046. Epub 2017 Oct 5.
Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors, they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a "Regression with Summary Statistics" (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.
用于大规模多元回归的贝叶斯方法为全基因组关联研究(GWAS)分析提供了有吸引力的途径。例如,它们可以估计复杂性状的遗传力,同时考虑多基因模型和稀疏模型;并且通过将外部基因组数据纳入先验中,它们可以提高检验效能并产生新的生物学见解。然而,这些方法需要获取个体基因型和表型,而这些数据往往不易获得。在此,我们提供了一个无需个体水平数据即可进行这些分析的框架。具体而言,我们引入了一种“基于汇总统计量的回归”(RSS)似然函数,它将多元回归系数与通常容易获得的单变量回归结果联系起来。RSS似然函数需要协变量(单核苷酸多态性,SNPs)之间相关性的估计值,这些估计值也可以从公共数据库中获得。我们通过将RSS似然函数与先前提出的先验分布相结合来进行贝叶斯多元回归分析,利用马尔可夫链蒙特卡罗方法对后验分布进行采样。在广泛的模拟中,无论是估计遗传力还是检测关联,RSS的表现都与使用个体数据进行的分析相似。我们将RSS应用于一项包含253,288名个体、106万个SNPs分型的人类身高GWAS研究,对于该研究,分析个体水平数据实际上是不可能的。遗传力估计值(52%)与使用这些数据子集的先前结果一致,但更为精确。我们还在分析中识别出许多先前未报道的与身高相关的位点。软件可在https://github.com/stephenslab/rss获取。