Li Qiwei, Cassese Alberto, Guindani Michele, Vannucci Marina
Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, Texas, U.S.A.
Department of Methodology and Statistics, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands.
Biometrics. 2019 Mar;75(1):183-192. doi: 10.1111/biom.12962. Epub 2018 Sep 19.
In this article, we develop a Bayesian hierarchical mixture regression model for studying the association between a multivariate response, measured as counts on a set of features, and a set of covariates. We have available RNA-Seq and DNA methylation data measured on breast cancer patients at different stages of the disease. We account for the heterogeneity and over-dispersion of count data (here, RNA-Seq data) by considering a mixture of negative binomial distributions and incorporate the covariates (here, methylation data) into the model via a linear modeling construction on the mean components. Our modeling construction includes several innovative characteristics. First, it employs selection techniques that allow the identification of a small subset of features that best discriminate the samples while simultaneously selecting a set of covariates associated to each feature. Second, it incorporates known dependencies into the feature selection process via the use of Markov random field (MRF) priors. On simulated data, we show how incorporating existing information via the prior model can improve the accuracy of feature selection. In the analysis of RNA-Seq and DNA methylation data on breast cancer, we incorporate knowledge on relationships among genes via a gene-gene network, which we extract from the KEGG database. Our data analysis identifies genes which are discriminatory of cancer stages and simultaneously selects significant associations between those genes and DNA methylation sites. A biological interpretation of our findings reveals several biomarkers that can help understanding the effect of DNA methylation on gene expression transcription across cancer stages.
在本文中,我们开发了一种贝叶斯分层混合回归模型,用于研究以一组特征计数衡量的多变量响应与一组协变量之间的关联。我们有在乳腺癌患者疾病不同阶段测量的RNA测序和DNA甲基化数据。我们通过考虑负二项分布的混合来处理计数数据(此处为RNA测序数据)的异质性和过度离散,并通过对均值成分进行线性建模结构将协变量(此处为甲基化数据)纳入模型。我们的建模结构包括几个创新特征。首先,它采用选择技术,能够识别最能区分样本的一小部分特征,同时选择与每个特征相关的一组协变量。其次,它通过使用马尔可夫随机场(MRF)先验将已知的依赖性纳入特征选择过程。在模拟数据上,我们展示了通过先验模型纳入现有信息如何提高特征选择的准确性。在对乳腺癌的RNA测序和DNA甲基化数据分析中,我们通过从KEGG数据库提取的基因-基因网络纳入基因之间关系的知识。我们的数据分析识别出区分癌症阶段的基因,并同时选择这些基因与DNA甲基化位点之间的显著关联。我们研究结果的生物学解释揭示了几种生物标志物,有助于理解DNA甲基化在癌症各阶段对基因表达转录的影响。