Khanal Jhabindra, Lim Dae Young, Tayara Hilal, Chong Kil To
Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.
Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea; Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea.
Genomics. 2021 Jan;113(1 Pt 2):582-592. doi: 10.1016/j.ygeno.2020.09.054. Epub 2020 Oct 1.
DNA N6-methyladenine (6 mA) is an epigenetic modification that plays a vital role in a variety of cellular processes in both eukaryotes and prokaryotes. Accurate information of 6 mA sites in the Rosaceae genome may assist in understanding genomic 6 mA distributions and various biological functions such as epigenetic inheritance. Various studies have shown the possibility of identifying 6 mA sites through experiments, but the procedures are time-consuming and costly. To overcome the drawbacks of experimental methods, we propose an accurate computational paradigm based on a machine learning (ML) technique to identify 6 mA sites in Rosa chinensis (R.chinensis) and Fragaria vesca (F.vesca). To improve the performance of the proposed model and to avoid overfitting, a recursive feature elimination with cross-validation (RFECV) strategy is used to extract the optimal number of features (ONF) subset from five different DNA sequence encoding schemes, i.e., Binary Encoding (BE), Ring-Function-Hydrogen-Chemical Properties (RFHC), Electron-Ion-Interaction Pseudo Potentials of Nucleotides (EIIP), Dinucleotide Physicochemical Properties (DPCP), and Trinucleotide Physicochemical Properties (TPCP). Subsequently, we use the ONF subset to train a double layers of ML-based stacking model to create a bioinformatics tool named 'i6mA-stack'. This tool outperforms its peer tool in general and is currently available at http://nsclbio.jbnu.ac.kr/tools/i6mA-stack/.
DNA N6-甲基腺嘌呤(6 mA)是一种表观遗传修饰,在真核生物和原核生物的多种细胞过程中发挥着至关重要的作用。蔷薇科基因组中6 mA位点的准确信息可能有助于理解基因组6 mA分布以及诸如表观遗传遗传等各种生物学功能。各种研究表明通过实验鉴定6 mA位点的可能性,但这些程序既耗时又昂贵。为了克服实验方法的缺点,我们提出了一种基于机器学习(ML)技术的准确计算范式,用于鉴定中国蔷薇(R.chinensis)和野草莓(F.vesca)中的6 mA位点。为了提高所提出模型的性能并避免过拟合,采用了带有交叉验证的递归特征消除(RFECV)策略,从五种不同的DNA序列编码方案中提取最优特征数量(ONF)子集,即二进制编码(BE)、环函数-氢-化学性质(RFHC)、核苷酸的电子-离子相互作用赝势(EIIP)、二核苷酸物理化学性质(DPCP)和三核苷酸物理化学性质(TPCP)。随后,我们使用ONF子集训练一个基于ML的双层堆叠模型,以创建一个名为“i6mA-stack”的生物信息学工具。该工具总体上优于同类工具,目前可在http://nsclbio.jbnu.ac.kr/tools/i6mA-stack/获取。