Department of Biostatistics, University of Florida, Gainesville, FL 32603, United States.
Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, United States.
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae528.
5-Hydroxymethylcytosine (5hmC), a crucial epigenetic mark with a significant role in regulating tissue-specific gene expression, is essential for understanding the dynamic functions of the human genome. Despite its importance, predicting 5hmC modification across the genome remains a challenging task, especially when considering the complex interplay between DNA sequences and various epigenetic factors such as histone modifications and chromatin accessibility.
Using tissue-specific 5hmC sequencing data, we introduce Deep5hmC, a multimodal deep learning framework that integrates both the DNA sequence and epigenetic features such as histone modification and chromatin accessibility to predict genome-wide 5hmC modification. The multimodal design of Deep5hmC demonstrates remarkable improvement in predicting both qualitative and quantitative 5hmC modification compared to unimodal versions of Deep5hmC and state-of-the-art machine learning methods. This improvement is demonstrated through benchmarking on a comprehensive set of 5hmC sequencing data collected at four developmental stages during forebrain organoid development and across 17 human tissues. Compared to DeepSEA and random forest, Deep5hmC achieves close to 4% and 17% improvement of Area Under the Receiver Operating Characteristic (AUROC) across four forebrain developmental stages, and 6% and 27% across 17 human tissues for predicting binary 5hmC modification sites; and 8% and 22% improvement of Spearman correlation coefficient across four forebrain developmental stages, and 17% and 30% across 17 human tissues for predicting continuous 5hmC modification. Notably, Deep5hmC showcases its practical utility by accurately predicting gene expression and identifying differentially hydroxymethylated regions (DhMRs) in a case-control study of Alzheimer's disease (AD). Deep5hmC significantly improves our understanding of tissue-specific gene regulation and facilitates the development of new biomarkers for complex diseases.
Deep5hmC is available via https://github.com/lichen-lab/Deep5hmC.
5-羟甲基胞嘧啶(5hmC)是一种重要的表观遗传标记,在调节组织特异性基因表达方面具有重要作用,对于理解人类基因组的动态功能至关重要。尽管其重要性不言而喻,但预测整个基因组中的 5hmC 修饰仍然是一项具有挑战性的任务,尤其是在考虑 DNA 序列与各种表观遗传因素(如组蛋白修饰和染色质可及性)之间的复杂相互作用时。
我们使用组织特异性 5hmC 测序数据,引入了 Deep5hmC,这是一种多模态深度学习框架,它整合了 DNA 序列和表观遗传特征,如组蛋白修饰和染色质可及性,以预测全基因组 5hmC 修饰。Deep5hmC 的多模态设计在预测定性和定量 5hmC 修饰方面与 Deep5hmC 的单模态版本和最先进的机器学习方法相比,均有显著提高。通过在大脑器官发生的四个发育阶段以及 17 个人类组织中收集的综合 5hmC 测序数据集上进行基准测试,证明了这一改进。与 DeepSEA 和随机森林相比,Deep5hmC 在预测四个大脑发育阶段的二元 5hmC 修饰位点时,AUROC 提高了近 4%和 17%;在预测 17 个人类组织时,AUROC 提高了 6%和 27%;在预测四个大脑发育阶段的连续 5hmC 修饰时,Spearman 相关系数提高了 8%和 22%;在预测 17 个人类组织时,Spearman 相关系数提高了 17%和 30%。值得注意的是,Deep5hmC 在阿尔茨海默病(AD)病例对照研究中准确预测基因表达和识别差异羟甲基化区域(DhMRs),展示了其实用性。Deep5hmC 显著提高了我们对组织特异性基因调控的理解,并为复杂疾病的新生物标志物的开发提供了帮助。
Deep5hmC 可通过 https://github.com/lichen-lab/Deep5hmC 获得。