Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, U.S.A.
Stat Med. 2012 Jul 10;31(15):1633-51. doi: 10.1002/sim.4493. Epub 2012 Mar 22.
Many statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. Limited work has been carried out in the regression setting to study the effects of clinical covariates and expression levels of genes in a pathway either on a continuous or on a binary clinical outcome. Hence, we propose a Bayesian approach for identifying pathways related to both types of outcomes. We compare our Bayesian approaches with a likelihood-based approach that was developed by relating a least squares kernel machine for nonparametric pathway effect with a restricted maximum likelihood for variance components. Unlike the likelihood-based approach, the Bayesian approach allows us to directly estimate all parameters and pathway effects. It can incorporate prior knowledge into Bayesian hierarchical model formulation and makes inference by using the posterior samples without asymptotic theory. We consider several kernels (Gaussian, polynomial, and neural network kernels) to characterize gene expression effects in a pathway on clinical outcomes. Our simulation results suggest that the Bayesian approach has more accurate coverage probability than the likelihood-based approach, and this is especially so when the sample size is small compared with the number of genes being studied in a pathway. We demonstrate the usefulness of our approaches through its applications to a type II diabetes mellitus data set. Our approaches can also be applied to other settings where a large number of strongly correlated predictors are present.
许多微阵列数据分析的统计方法一次只考虑一个基因,它们可能会错过单个基因水平上的细微变化。通过同时考虑一组基因,可以克服这一局限性,其中基因集来自先前的生物学知识。在回归设置中,已经开展了有限的工作来研究临床协变量和途径中基因的表达水平对连续或二进制临床结果的影响。因此,我们提出了一种贝叶斯方法来识别与这两种结果都相关的途径。我们将我们的贝叶斯方法与基于似然的方法进行了比较,该方法通过将非参数途径效应的最小二乘核机器与方差分量的最大限制似然相关联来开发。与基于似然的方法不同,贝叶斯方法允许我们直接估计所有参数和途径效应。它可以将先验知识纳入贝叶斯层次模型的公式化中,并通过使用后验样本而无需渐近理论进行推断。我们考虑了几种核(高斯核、多项式核和神经网络核)来描述途径中基因表达对临床结果的影响。我们的模拟结果表明,贝叶斯方法的覆盖率概率比基于似然的方法更准确,尤其是当样本量与途径中研究的基因数量相比较小时。我们通过将其应用于 II 型糖尿病数据集来证明我们方法的有效性。我们的方法还可以应用于存在大量强相关预测因子的其他环境中。