Kim Seong Gon, Theera-Ampornpunt Nawanol, Fang Chih-Hao, Harwani Mrudul, Grama Ananth, Chaterji Somali
Department of Computer Science, Purdue University, West Lafayette, IN, USA.
BMC Syst Biol. 2016 Aug 1;10 Suppl 2(Suppl 2):54. doi: 10.1186/s12918-016-0302-3.
Gene expression is mediated by specialized cis-regulatory modules (CRMs), the most prominent of which are called enhancers. Early experiments indicated that enhancers located far from the gene promoters are often responsible for mediating gene transcription. Knowing their properties, regulatory activity, and genomic targets is crucial to the functional understanding of cellular events, ranging from cellular homeostasis to differentiation. Recent genome-wide investigation of epigenomic marks has indicated that enhancer elements could be enriched for certain epigenomic marks, such as, combinatorial patterns of histone modifications.
Our efforts in this paper are motivated by these recent advances in epigenomic profiling methods, which have uncovered enhancer-associated chromatin features in different cell types and organisms. Specifically, in this paper, we use recent state-of-the-art Deep Learning methods and develop a deep neural network (DNN)-based architecture, called EP-DNN, to predict the presence and types of enhancers in the human genome. It uses as features, the expression levels of the histone modifications at the peaks of the functional sites as well as in its adjacent regions. We apply EP-DNN to four different cell types: H1, IMR90, HepG2, and HeLa S3. We train EP-DNN using p300 binding sites as enhancers, and TSS and random non-DHS sites as non-enhancers. We perform EP-DNN predictions to quantify the validation rate for different levels of confidence in the predictions and also perform comparisons against two state-of-the-art computational models for enhancer predictions, DEEP-ENCODE and RFECS.
We find that EP-DNN has superior accuracy and takes less time to make predictions. Next, we develop methods to make EP-DNN interpretable by computing the importance of each input feature in the classification task. This analysis indicates that the important histone modifications were distinct for different cell types, with some overlaps, e.g., H3K27ac was important in cell type H1 but less so in HeLa S3, while H3K4me1 was relatively important in all four cell types. We finally use the feature importance analysis to reduce the number of input features needed to train the DNN, thus reducing training time, which is often the computational bottleneck in the use of a DNN.
In this paper, we developed EP-DNN, which has high accuracy of prediction, with validation rates above 90 % for the operational region of enhancer prediction for all four cell lines that we studied, outperforming DEEP-ENCODE and RFECS. Then, we developed a method to analyze a trained DNN and determine which histone modifications are important, and within that, which features proximal or distal to the enhancer site, are important.
基因表达由专门的顺式调控模块(CRM)介导,其中最突出的称为增强子。早期实验表明,位于远离基因启动子的增强子通常负责介导基因转录。了解它们的特性、调控活性和基因组靶点对于从细胞稳态到分化的细胞事件的功能理解至关重要。最近对表观基因组标记的全基因组研究表明,增强子元件可能富含某些表观基因组标记,例如组蛋白修饰的组合模式。
本文的研究工作受到表观基因组分析方法的最新进展的推动,这些进展揭示了不同细胞类型和生物体中与增强子相关的染色质特征。具体而言,在本文中,我们使用最新的深度学习方法,开发了一种基于深度神经网络(DNN)的架构,称为EP-DNN,用于预测人类基因组中增强子的存在和类型。它将功能位点峰值及其相邻区域的组蛋白修饰表达水平用作特征。我们将EP-DNN应用于四种不同的细胞类型:H1、IMR90、HepG2和HeLa S3。我们使用p300结合位点作为增强子,转录起始位点(TSS)和随机非 DHS 位点作为非增强子来训练EP-DNN。我们进行EP-DNN预测,以量化不同置信水平下预测的验证率,并与两种用于增强子预测的先进计算模型DEEP-ENCODE和RFECS进行比较。
我们发现EP-DNN具有更高的准确性,并且进行预测所需的时间更少。接下来,我们通过计算分类任务中每个输入特征的重要性来开发使EP-DNN可解释的方法。该分析表明,重要的组蛋白修饰在不同细胞类型中各不相同,但存在一些重叠,例如,H3K27ac在H1细胞类型中很重要,但在HeLa S3中不太重要,而H3K4me1在所有四种细胞类型中相对都很重要。我们最终使用特征重要性分析来减少训练DNN所需的输入特征数量,从而减少训练时间,而训练时间通常是使用DNN时的计算瓶颈。
在本文中,我们开发了EP-DNN,其预测准确性高,对于我们研究的所有四种细胞系,在增强子预测的操作区域内验证率均高于90%,优于DEEP-ENCODE和RFECS。然后,我们开发了一种方法来分析训练后的DNN,并确定哪些组蛋白修饰是重要的,以及在这些修饰中,增强子位点近端或远端的哪些特征是重要的。