Scharpf Robert B, Tjelmeland Håkon, Parmigiani Giovanni, Nobel Andrew B
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205.
J Am Stat Assoc. 2009;104(488):1295-1310. doi: 10.1198/jasa.2009.ap07611.
In this paper we define a hierarchical Bayesian model for microarray expression data collected from several studies and use it to identify genes that show differential expression between two conditions. Key features include shrinkage across both genes and studies, and flexible modeling that allows for interactions between platforms and the estimated effect, as well as concordant and discordant differential expression across studies. We evaluated the performance of our model in a comprehensive fashion, using both artificial data, and a "split-study" validation approach that provides an agnostic assessment of the model's behavior not only under the null hypothesis, but also under a realistic alternative. The simulation results from the artificial data demonstrate the advantages of the Bayesian model. The 1 - AUC values for the Bayesian model are roughly half of the corresponding values for a direct combination of t- and SAM-statistics. Furthermore, the simulations provide guidelines for when the Bayesian model is most likely to be useful. Most noticeably, in small studies the Bayesian model generally outperforms other methods when evaluated by AUC, FDR, and MDR across a range of simulation parameters, and this difference diminishes for larger sample sizes in the individual studies. The split-study validation illustrates appropriate shrinkage of the Bayesian model in the absence of platform-, sample-, and annotation-differences that otherwise complicate experimental data analyses. Finally, we fit our model to four breast cancer studies employing different technologies (cDNA and Affymetrix) to estimate differential expression in estrogen receptor positive tumors versus negative ones. Software and data for reproducing our analysis are publicly available.
在本文中,我们为从多项研究中收集的微阵列表达数据定义了一种分层贝叶斯模型,并使用该模型来识别在两种条件下显示差异表达的基因。关键特征包括跨基因和跨研究的收缩,以及灵活的建模,该建模允许平台与估计效应之间的相互作用,以及跨研究的一致和不一致的差异表达。我们使用人工数据以及一种“拆分研究”验证方法,以全面的方式评估了我们模型的性能,该验证方法不仅在原假设下,而且在现实的备择假设下,都能对模型的行为进行无偏评估。来自人工数据的模拟结果证明了贝叶斯模型的优势。贝叶斯模型的1 - AUC值大约是t统计量和SAM统计量直接组合的相应值的一半。此外,模拟为贝叶斯模型最可能有用的情况提供了指导方针。最值得注意的是,在小型研究中,当在一系列模拟参数下通过AUC、FDR和MDR进行评估时,贝叶斯模型通常优于其他方法,并且对于单个研究中较大的样本量,这种差异会减小。拆分研究验证说明了在不存在否则会使实验数据分析复杂化的平台、样本和注释差异的情况下,贝叶斯模型的适当收缩。最后,我们将我们的模型应用于四项采用不同技术(cDNA和Affymetrix)的乳腺癌研究,以估计雌激素受体阳性肿瘤与阴性肿瘤之间的差异表达。用于重现我们分析的软件和数据可公开获取。