Budden David M, Hurley Daniel G, Crampin Edmund J
Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia.
Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia.
Epigenetics Chromatin. 2015 Jun 19;8:21. doi: 10.1186/s13072-015-0013-9. eCollection 2015.
Predictive modelling of gene expression is a powerful framework for the in silico exploration of transcriptional regulatory interactions through the integration of high-throughput -omics data. A major limitation of previous approaches is their inability to handle conditional interactions that emerge when genes are subject to different regulatory mechanisms. Although chromatin immunoprecipitation-based histone modification data are often used as proxies for chromatin accessibility, the association between these variables and expression often depends upon the presence of other epigenetic markers (e.g. DNA methylation or histone variants). These conditional interactions are poorly handled by previous predictive models and reduce the reliability of downstream biological inference.
We have previously demonstrated that integrating both transcription factor and histone modification data within a single predictive model is rendered ineffective by their statistical redundancy. In this study, we evaluate four proposed methods for quantifying gene-level DNA methylation levels and demonstrate that inclusion of these data in predictive modelling frameworks is also subject to this critical limitation in data integration. Based on the hypothesis that statistical redundancy in epigenetic data is caused by conditional regulatory interactions within a dynamic chromatin context, we construct a new gene expression model which is the first to improve prediction accuracy by unsupervised identification of latent regulatory classes. We show that DNA methylation and H2A.Z histone variant data can be interpreted in this way to identify and explore the signatures of silenced and bivalent promoters, substantially improving genome-wide predictions of mRNA transcript abundance and downstream biological inference across multiple cell lines.
Previous models of gene expression have been applied successfully to several important problems in molecular biology, including the discovery of transcription factor roles, identification of regulatory elements responsible for differential expression patterns and comparative analysis of the transcriptome across distant species. Our analysis supports our hypothesis that statistical redundancy in epigenetic data is partially due to conditional relationships between these regulators and gene expression levels. This analysis provides insight into the heterogeneous roles of H3K4me3 and H3K27me3 in the presence of the H2A.Z histone variant (implicated in cancer progression) and how these signatures change during lineage commitment and carcinogenesis.
基因表达的预测建模是一个强大的框架,用于通过整合高通量组学数据在计算机上探索转录调控相互作用。先前方法的一个主要局限性在于它们无法处理基因受到不同调控机制时出现的条件性相互作用。尽管基于染色质免疫沉淀的组蛋白修饰数据常被用作染色质可及性的替代指标,但这些变量与表达之间的关联通常取决于其他表观遗传标记(如DNA甲基化或组蛋白变体)的存在。先前的预测模型对这些条件性相互作用处理不佳,降低了下游生物学推断的可靠性。
我们之前已经证明,在单个预测模型中整合转录因子和组蛋白修饰数据会因它们的统计冗余而变得无效。在本研究中,我们评估了四种用于量化基因水平DNA甲基化水平的提议方法,并证明将这些数据纳入预测建模框架也会受到数据整合中这一关键限制的影响。基于表观遗传数据中的统计冗余是由动态染色质环境中的条件性调控相互作用引起的这一假设,我们构建了一个新的基因表达模型,这是首个通过无监督识别潜在调控类别来提高预测准确性的模型。我们表明,DNA甲基化和H2A.Z组蛋白变体数据可以通过这种方式进行解释,以识别和探索沉默和双价启动子的特征,从而在多个细胞系中大幅提高全基因组mRNA转录本丰度的预测以及下游生物学推断。
先前的基因表达模型已成功应用于分子生物学中的几个重要问题,包括转录因子作用的发现、负责差异表达模式的调控元件的鉴定以及跨远缘物种转录组的比较分析。我们的分析支持了我们的假设,即表观遗传数据中的统计冗余部分归因于这些调控因子与基因表达水平之间的条件关系。该分析深入了解了H3K4me3和H3K27me3在存在H2A.Z组蛋白变体(与癌症进展有关)时的异质作用,以及这些特征在谱系定向和致癌过程中如何变化。