College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China.
Biomed Res Int. 2021 May 29;2021:5515342. doi: 10.1155/2021/5515342. eCollection 2021.
As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.
作为重要的表观遗传修饰之一,DNA N4-甲基胞嘧啶(4mC)在控制基因复制、表达、细胞周期、DNA 复制和分化方面起着关键作用。准确识别 4mC 位点对于理解生物学功能是必要的。在本文中,我们使用集成学习开发了一种名为 i4mC-EL 的模型,用于识别小鼠基因组中的 4mC 位点。首先,采用了一种由 Kmer 和 EIIP 组成的多特征编码方案来描述 DNA 序列。其次,在多特征编码方案的基础上,我们开发了一个堆叠集成模型,其中使用了四种机器学习算法,即贝叶斯网络、朴素贝叶斯、LibSVM 和投票感知机,来实现基本分类器的集成,这些基本分类器的中间结果作为元分类器 Logistic 的输入。在独立测试数据集上的实验结果表明,i4mC-EL 的总体预测准确率为 82.19%,优于现有方法。i4mC-EL 的用户友好型网站可在以下网址免费访问。