Hasan Md Mehedi, Manavalan Balachandran, Shoombuatong Watshara, Khatun Mst Shamima, Kurata Hiroyuki
Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
Japan Society for the Promotion of Science, 5-3-1 Kojimachi, Chiyoda-ku, Tokyo 102-0083, Japan.
Comput Struct Biotechnol J. 2020 Apr 8;18:906-912. doi: 10.1016/j.csbj.2020.04.001. eCollection 2020.
4-methylcytosine (4mC) is one of the most important DNA modifications and involved in regulating cell differentiations and gene expressions. The accurate identification of 4mC sites is necessary to understand various biological functions. In this work, we developed a new computational predictor called i4mC-Mouse to identify 4mC sites in the mouse genome. Herein, six encoding schemes of k-space nucleotide composition (KSNC), k-mer nucleotide composition (Kmer), mono nucleotide binary encoding (MBE), dinucleotide binary encoding, electron-ion interaction pseudo potentials (EIIP) and dinucleotide physicochemical composition were explored that cover different characteristics of DNA sequence information. Subsequently, we built six RF-based encoding models and then linearly combined their probability scores to construct the final predictor. Among the six RF-based models, the Kmer, KSNC, MBE, and EIIP encodings are sufficient, which contributed to 10%, 45%, 25%, and 20% of the prediction performance, respectively. On the independent test the i4mC-Mouse predicted the 4mC sites with accuracy and MCC of 0.816 and 0.633, respectively, which were approximately 2.5% and 5% higher than those of the existing method (4mCpred-EL). For experimental biologists, a freely available web application was implemented at http://kurata14.bio.kyutech.ac.jp/i4mC-Mouse/.
4-甲基胞嘧啶(4mC)是最重要的DNA修饰之一,参与调节细胞分化和基因表达。准确识别4mC位点对于理解各种生物学功能至关重要。在这项工作中,我们开发了一种名为i4mC-Mouse的新型计算预测工具,用于识别小鼠基因组中的4mC位点。在此,我们探索了六种编码方案,包括k空间核苷酸组成(KSNC)、k-mer核苷酸组成(Kmer)、单核苷酸二进制编码(MBE)、二核苷酸二进制编码、电子-离子相互作用伪势(EIIP)和二核苷酸物理化学组成,这些方案涵盖了DNA序列信息的不同特征。随后,我们构建了六个基于随机森林(RF)的编码模型,然后将它们的概率得分进行线性组合,以构建最终的预测工具。在六个基于RF的模型中,Kmer、KSNC、MBE和EIIP编码是充分的,它们分别对预测性能贡献了10%、45%、25%和20%。在独立测试中,i4mC-Mouse预测4mC位点的准确率和马修斯相关系数(MCC)分别为0.816和0.633,比现有方法(4mCpred-EL)高出约2.5%和5%。对于实验生物学家,我们在http://kurata14.bio.kyutech.ac.jp/i4mC-Mouse/上实现了一个免费的网络应用程序。