La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA.
Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA.
Genome Biol. 2024 Jun 3;25(1):142. doi: 10.1186/s13059-024-03273-z.
Like its parent base 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) is a direct epigenetic modification of cytosines in the context of CpG dinucleotides. 5hmC is the most abundant oxidized form of 5mC, generated through the action of TET dioxygenases at gene bodies of actively-transcribed genes and at active or lineage-specific enhancers. Although such enrichments are reported for 5hmC, to date, predictive models of gene expression state or putative regulatory regions for genes using 5hmC have not been developed.
Here, by using only 5hmC enrichment in genic regions and their vicinity, we develop neural network models that predict gene expression state across 49 cell types. We show that our deep neural network models distinguish high vs low expression state utilizing only 5hmC levels and these predictive models generalize to unseen cell types. Further, in order to leverage 5hmC signal in distal enhancers for expression prediction, we employ an Activity-by-Contact model and also develop a graph convolutional neural network model with both utilizing Hi-C data and 5hmC enrichment to prioritize enhancer-promoter links. These approaches identify known and novel putative enhancers for key genes in multiple immune cell subsets.
Our work highlights the importance of 5hmC in gene regulation through proximal and distal mechanisms and provides a framework to link it to genome function. With the recent advances in 6-letter DNA sequencing by short and long-read techniques, profiling of 5mC and 5hmC may be done routinely in the near future, hence, providing a broad range of applications for the methods developed here.
与母碱基 5-甲基胞嘧啶(5mC)类似,5-羟甲基胞嘧啶(5hmC)是 CpG 二核苷酸中胞嘧啶的直接表观遗传修饰。5hmC 是 5mC 的最丰富的氧化形式,通过 TET 双加氧酶在活跃转录基因的基因体中和活跃或谱系特异性增强子处产生。尽管已经报道了 5hmC 的这种富集,但迄今为止,尚未开发使用 5hmC 预测基因表达状态或假定基因调控区的模型。
在这里,我们仅使用基因区域及其附近的 5hmC 富集,开发了可以预测 49 种细胞类型中基因表达状态的神经网络模型。我们表明,我们的深度神经网络模型利用仅 5hmC 水平即可区分高表达状态与低表达状态,并且这些预测模型可以推广到未见的细胞类型。此外,为了利用远端增强子中的 5hmC 信号进行表达预测,我们采用了活性接触模型,并开发了一个图卷积神经网络模型,两者都利用 Hi-C 数据和 5hmC 富集来优先考虑增强子-启动子连接。这些方法确定了多个免疫细胞亚群中关键基因的已知和新的假定增强子。
我们的工作强调了 5hmC 通过近端和远端机制在基因调控中的重要性,并提供了将其与基因组功能联系起来的框架。随着最近短读长和长读长 6 字母 DNA 测序技术的进步,5mC 和 5hmC 的分析可能在不久的将来常规进行,因此,为这里开发的方法提供了广泛的应用。