China Agricultural University, Beijing.
Department of Physiology, Ajou University School of Medicine, Republic of Korea.
Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa202.
DNA N6-methyladenine (6mA) represents important epigenetic modifications, which are responsible for various cellular processes. The accurate identification of 6mA sites is one of the challenging tasks in genome analysis, which leads to an understanding of their biological functions. To date, several species-specific machine learning (ML)-based models have been proposed, but majority of them did not test their model to other species. Hence, their practical application to other plant species is quite limited. In this study, we explored 10 different feature encoding schemes, with the goal of capturing key characteristics around 6mA sites. We selected five feature encoding schemes based on physicochemical and position-specific information that possesses high discriminative capability. The resultant feature sets were inputted to six commonly used ML methods (random forest, support vector machine, extremely randomized tree, logistic regression, naïve Bayes and AdaBoost). The Rosaceae genome was employed to train the above classifiers, which generated 30 baseline models. To integrate their individual strength, Meta-i6mA was proposed that combined the baseline models using the meta-predictor approach. In extensive independent test, Meta-i6mA showed high Matthews correlation coefficient values of 0.918, 0.827 and 0.635 on Rosaceae, rice and Arabidopsis thaliana, respectively and outperformed the existing predictors. We anticipate that the Meta-i6mA can be applied across different plant species. Furthermore, we developed an online user-friendly web server, which is available at http://kurata14.bio.kyutech.ac.jp/Meta-i6mA/.
DNA N6-甲基腺嘌呤(6mA)代表重要的表观遗传修饰,负责各种细胞过程。6mA 位点的准确识别是基因组分析中的一项具有挑战性的任务,这有助于理解其生物学功能。迄今为止,已经提出了几种基于机器学习(ML)的物种特异性模型,但大多数模型都没有将其模型应用于其他物种。因此,它们在其他植物物种中的实际应用相当有限。在这项研究中,我们探索了 10 种不同的特征编码方案,旨在捕获 6mA 位点周围的关键特征。我们选择了基于理化性质和位置特异性信息的 5 种特征编码方案,这些方案具有较高的区分能力。将所得特征集输入到 6 种常用的 ML 方法(随机森林、支持向量机、极端随机树、逻辑回归、朴素贝叶斯和 AdaBoost)中。使用蔷薇科基因组来训练上述分类器,生成 30 个基线模型。为了整合它们的个体优势,我们提出了 Meta-i6mA,该方法使用元预测器方法组合基线模型。在广泛的独立测试中,Meta-i6mA 在蔷薇科、水稻和拟南芥上的 Matthews 相关系数值分别高达 0.918、0.827 和 0.635,优于现有的预测器。我们预计 Meta-i6mA 可以应用于不同的植物物种。此外,我们开发了一个在线用户友好的网络服务器,可在 http://kurata14.bio.kyutech.ac.jp/Meta-i6mA/ 上获得。