Department of Biochemistry and Molecular Genetics, University of Virginia Health System, Charlottesville, Virginia, USA.
BMC Bioinformatics. 2010 Jul 23;11:396. doi: 10.1186/1471-2105-11-396.
In the last decade, biochemical studies have revealed that epigenetic modifications including histone modifications, histone variants and DNA methylation form a complex network that regulate the state of chromatin and processes that depend on it including transcription and DNA replication. Currently, a large number of these epigenetic modifications are being mapped in a variety of cell lines at different stages of development using high throughput sequencing by members of the ENCODE consortium, the NIH Roadmap Epigenomics Program and the Human Epigenome Project. An extremely promising and underexplored area of research is the application of machine learning methods, which are designed to construct predictive network models, to these large-scale epigenomic data sets.
Using a ChIP-Seq data set of 20 histone lysine and arginine methylations and histone variant H2A.Z in human CD4+ T-cells, we built predictive models of gene expression as a function of histone modification/variant levels using Multilinear (ML) Regression and Multivariate Adaptive Regression Splines (MARS). Along with extensive crosstalk among the 20 histone methylations, we found H4R3me2 was the most and second most globally repressive histone methylation among the 20 studied in the ML and MARS models, respectively. In support of our finding, a number of experimental studies show that PRMT5-catalyzed symmetric dimethylation of H4R3 is associated with repression of gene expression. This includes a recent study, which demonstrated that H4R3me2 is required for DNMT3A-mediated DNA methylation--a known global repressor of gene expression.
In stark contrast to univariate analysis of the relationship between H4R3me2 and gene expression levels, our study showed that the regulatory role of some modifications like H4R3me2 is masked by confounding variables, but can be elucidated by multivariate/systems-level approaches.
在过去的十年中,生化研究揭示了表观遗传修饰,包括组蛋白修饰、组蛋白变体和 DNA 甲基化,形成了一个复杂的网络,调节染色质的状态和依赖于它的过程,包括转录和 DNA 复制。目前,ENCODE 联盟、NIH 路线图表观基因组学计划和人类表观基因组学计划的成员正在使用高通量测序技术在各种细胞系中对大量这些表观遗传修饰进行映射,这些细胞系处于不同的发育阶段。一个极具前景但尚未得到充分探索的研究领域是机器学习方法的应用,这些方法旨在构建预测网络模型,以应用于这些大规模的表观基因组数据集。
我们使用人类 CD4+T 细胞中的 20 种组蛋白赖氨酸和精氨酸甲基化以及组蛋白变体 H2A.Z 的 ChIP-Seq 数据集,使用多线性(ML)回归和多变量自适应回归样条(MARS)构建了作为组蛋白修饰/变体水平的函数的基因表达的预测模型。除了 20 种组蛋白甲基化之间的广泛相互作用外,我们发现 H4R3me2 是 ML 和 MARS 模型中 20 种研究中最具全局抑制性的组蛋白甲基化,其次是第二。支持我们的发现,许多实验研究表明,PRMT5 催化的 H4R3 对称二甲基化与基因表达的抑制有关。其中包括最近的一项研究表明,H4R3me2 是 DNMT3A 介导的 DNA 甲基化所必需的,DNMT3A 是一种已知的基因表达全局抑制剂。
与 H4R3me2 与基因表达水平之间的关系的单变量分析形成鲜明对比的是,我们的研究表明,一些修饰(如 H4R3me2)的调节作用被混杂变量所掩盖,但可以通过多变量/系统水平的方法来阐明。