Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea; Department of Molecular Science and Technology, Ajou University, Suwon 16499, Republic of Korea.
J Mol Biol. 2022 Jun 15;434(11):167549. doi: 10.1016/j.jmb.2022.167549. Epub 2022 Mar 16.
N-methylguanosine (m7G) is an essential, ubiquitous, and positively charged modification at the 5' cap of eukaryotic mRNA, modulating its export, translation, and splicing processes. Although several machine learning (ML)-based computational predictors for m7G have been developed, all utilized specific computational framework. This study is the first instance we explored four different computational frameworks and identified the best approach. Based on that we developed a novel predictor, THRONE (A three-layer ensemble predictor for identifying human RNA N7-methylguanosine sites) to accurately identify m7G sites from the human genome. THRONE employs a wide range of sequence-based features inputted to several ML classifiers and combines these models through ensemble learning. The three-step ensemble learning is as follows: 54 baseline models were constructed in the first layer and the predicted probability of m7G was considered as a new feature vector for the sequential step. Subsequently, six meta-models were created using the new feature vector and their predicted probability was yet again considered as novel features. Finally, random forest was deemed as the best super classifier learner for the final prediction using a systematic approach incorporated with novel features. Interestingly, THRONE outperformed other existing methods in the prediction of m7G sites on both cross-validation analysis and independent evaluation. The proposed method is publicly accessible at: http://thegleelab.org/THRONE/ and expects to help the scientific community identify the putative m7G sites and formulate a novel testable biological hypothesis.
N-甲基鸟苷(m7G)是真核生物 mRNA 5'帽的一种必需的、普遍存在的、带正电荷的修饰物,调节其输出、翻译和剪接过程。尽管已经开发了几种基于机器学习(ML)的 m7G 计算预测器,但它们都使用了特定的计算框架。本研究首次探索了四种不同的计算框架并确定了最佳方法。在此基础上,我们开发了一种新的预测器 THRONE(一种用于识别人类 RNA N7-甲基鸟苷位点的三层集成预测器),以从人类基因组中准确识别 m7G 位点。THRONE 采用了广泛的基于序列的特征输入到多个 ML 分类器中,并通过集成学习组合这些模型。三步集成学习如下:在第一层构建了 54 个基线模型,并将 m7G 的预测概率视为新的特征向量用于顺序步骤。随后,使用新的特征向量创建了六个元模型,并且它们的预测概率再次被视为新的特征。最后,随机森林被认为是最终预测的最佳超级分类器学习者,采用了一种系统的方法结合了新的特征。有趣的是,THRONE 在交叉验证分析和独立评估中都优于其他现有的 m7G 位点预测方法。该方法可在 http://thegleelab.org/THRONE/ 上公开访问,预计将帮助科学界识别假定的 m7G 位点并提出新的可测试的生物学假设。