Center for Precision Health, School of Biomedical Informatics.
Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa099.
DNA N4-methylcytosine (4mC) modification represents a novel epigenetic regulation. It involves in various cellular processes, including DNA replication, cell cycle and gene expression, among others. In addition to experimental identification of 4mC sites, in silico prediction of 4mC sites in the genome has emerged as an alternative and promising approach. In this study, we first reviewed the current progress in the computational prediction of 4mC sites and systematically evaluated the predictive capacity of eight conventional machine learning algorithms as well as 12 feature types commonly used in previous studies in six species. Using a representative benchmark dataset, we investigated the contribution of feature selection and stacking approach to the model construction, and found that feature optimization and proper reinforcement learning could improve the performance. We next recollected newly added 4mC sites in the six species' genomes and developed a novel deep learning-based 4mC site predictor, namely Deep4mC. Deep4mC applies convolutional neural networks with four representative features. For species with small numbers of samples, we extended our deep learning framework with a bootstrapping method. Our evaluation indicated that Deep4mC could obtain high accuracy and robust performance with the average area under curve (AUC) values greater than 0.9 in all species (range: 0.9005-0.9722). In comparison, Deep4mC achieved an AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. A user-friendly web server (https://bioinfo.uth.edu/Deep4mC) was built for predicting putative 4mC sites in a genome.
DNA N4-甲基胞嘧啶(4mC)修饰代表了一种新的表观遗传调控方式。它涉及多种细胞过程,包括 DNA 复制、细胞周期和基因表达等。除了实验鉴定 4mC 位点外,在基因组中预测 4mC 位点已成为一种替代且有前途的方法。在这项研究中,我们首先回顾了计算预测 4mC 位点的最新进展,并系统评估了八种常规机器学习算法以及过去研究中常用的 12 种特征类型在六个物种中的预测能力。使用具有代表性的基准数据集,我们研究了特征选择和堆叠方法对模型构建的贡献,并发现特征优化和适当的强化学习可以提高性能。接下来,我们重新收集了六个物种基因组中新增的 4mC 位点,并开发了一种新的基于深度学习的 4mC 位点预测器,即 Deep4mC。Deep4mC 使用具有四个代表性特征的卷积神经网络。对于样本数量较少的物种,我们使用引导方法扩展了我们的深度学习框架。我们的评估表明,Deep4mC 可以在所有物种中获得高准确性和稳健的性能,平均曲线下面积(AUC)值均大于 0.9(范围:0.9005-0.9722)。相比之下,Deep4mC 在这六个物种中的 AUC 值相对于以前的工具提高了 10.14%至 46.21%。我们还构建了一个用户友好的网络服务器(https://bioinfo.uth.edu/Deep4mC),用于预测基因组中的潜在 4mC 位点。