Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
Int J Biol Macromol. 2023 Jun 1;239:124247. doi: 10.1016/j.ijbiomac.2023.124247. Epub 2023 Mar 30.
2'-O-methylation (2OM) is an omnipresent post-transcriptional modification in RNAs. It is important for the regulation of RNA stability, mRNA splicing and translation, as well as innate immunity. With the increase in publicly available 2OM data, several computational tools have been developed for the identification of 2OM sites in human RNA. Unfortunately, these tools suffer from the low discriminative power of redundant features, unreasonable dataset construction or overfitting. To address those issues, based on four types of 2OM (2OM-adenine (A), cytosine (C), guanine (G), and uracil (U)) data, we developed a two-step feature selection model to identify 2OM. For each type, the one-way analysis of variance (ANOVA) combined with mutual information (MI) was proposed to rank sequence features for obtaining the optimal feature subset. Subsequently, four predictors based on eXtreme Gradient Boosting (XGBoost) or support vector machine (SVM) were presented to identify the four types of 2OM sites. Finally, the proposed model could produce an overall accuracy of 84.3 % on the independent set. To provide a convenience for users, an online tool called i2OM was constructed and can be freely access at i2om.lin-group.cn. The predictor may provide a reference for the study of the 2OM.
2'-O-甲基化(2OM)是 RNA 中转录后修饰的普遍存在形式。它对 RNA 稳定性、mRNA 剪接和翻译以及先天免疫的调节都很重要。随着 2OM 数据的不断增加,已经开发了几种计算工具来识别人类 RNA 中的 2OM 位点。不幸的是,这些工具存在冗余特征区分能力低、数据集构建不合理或过拟合等问题。为了解决这些问题,我们基于四种 2OM(2OM-腺嘌呤(A)、胞嘧啶(C)、鸟嘌呤(G)和尿嘧啶(U))数据,开发了一种两步特征选择模型来识别 2OM。对于每种类型,我们提出了一种基于单向方差分析(ANOVA)和互信息(MI)的方法来对序列特征进行排序,以获得最优的特征子集。随后,我们提出了四种基于极端梯度提升(XGBoost)或支持向量机(SVM)的预测器,用于识别四种 2OM 位点。最终,该模型在独立数据集上的整体准确率达到 84.3%。为了方便用户使用,我们构建了一个名为 i2OM 的在线工具,并可在 i2om.lin-group.cn 上免费访问。该预测器可以为 2OM 的研究提供参考。