Gao Shang, Rehman Jalees, Dai Yang
Department of Biomedical Engineering, University of Illinois at Chicago, Chicago, IL, USA.
Department of Medicine, Division of Cardiology, University of Illinois at Chicago, Chicago, IL, USA.
Comput Struct Biotechnol J. 2022 Jul 13;20:3814-3823. doi: 10.1016/j.csbj.2022.07.014. eCollection 2022.
Gene expression is regulated at both transcriptional and post-transcriptional levels. DNA sequence and epigenetic modifications are key factors which regulate gene transcription. Understanding their complex interactions and their respective contributions to gene expression regulation remains a challenge in biological studies. We have developed iSEGnet, a framework of deep convolutional neural network to predict mRNA abundance using the information on DNA sequences as well as epigenetic modifications within genes and their -regulatory regions. We demonstrate that our framework outperforms other machine learning models in terms of predicting mRNA abundance using transcriptional and epigenetic profiles from six distinct cell lines/types chosen from the ENCODE. The analysis from the learned models also reveals that specific regions around promotors and transcription termination sites are most important for gene expression regulation. Using the method of Integrated Gradients, we identify narrow segments in these regions which are most likely to impact gene expression for a specific epigenetic modification. We further show that these identified segments are enriched in known active regulatory regions by comparing the transcription factor binding sites obtained via ChIP-seq. Moreover, we demonstrate how iSEGnet can uncover potential transcription factors that have regulatory functions in cancer using two cancer multi-omics data.
基因表达在转录和转录后水平均受到调控。DNA序列和表观遗传修饰是调控基因转录的关键因素。了解它们之间的复杂相互作用及其对基因表达调控的各自贡献仍是生物学研究中的一项挑战。我们开发了iSEGnet,这是一个深度卷积神经网络框架,用于利用基因及其调控区域内的DNA序列信息以及表观遗传修饰来预测mRNA丰度。我们证明,在使用从ENCODE中选取的六种不同细胞系/类型的转录和表观遗传谱预测mRNA丰度方面,我们的框架优于其他机器学习模型。从学习模型进行的分析还表明,启动子和转录终止位点周围的特定区域对基因表达调控最为重要。使用综合梯度法,我们在这些区域中识别出最有可能影响特定表观遗传修饰的基因表达的狭窄片段。通过比较通过ChIP-seq获得的转录因子结合位点,我们进一步表明,这些识别出的片段在已知的活性调控区域中富集。此外,我们展示了iSEGnet如何利用两个癌症多组学数据揭示在癌症中具有调控功能的潜在转录因子。