Institute of computational biology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
Bioinformatics Laboratory, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
RNA Biol. 2021 Nov;18(11):1882-1892. doi: 10.1080/15476286.2021.1875180. Epub 2021 Feb 12.
Recent studies have shown that RNA methylation modification can affect RNA transcription, metabolism, splicing and stability. In addition, RNA methylation modification has been associated with cancer, obesity and other diseases. Based on information about human genome and machine learning, this paper discusses the effect of the fusion sequence and gene-level feature extraction on the accuracy of methylation site recognition. The significant limitation of existing computing tools was exposed by discovered of new features. (1) Most prediction models are based solely on sequence features and use SVM or random forest as classification methods. (2) Limited by the number of samples, the model may not achieve good performance. In order to establish a better prediction model for methylation sites, we must set specific weighting strategies for training samples and find more powerful and informative feature matrices to establish a comprehensive model. In this paper, we present HSM6AP, a high-precision predictor for the N6-methyladenosine () based on multiple weights and feature stitching. Compared with existing methods, HSM6AP samples were creatively weighted during training, and a wide range of features were explored. Max-Relevance-Max-Distance (MRMD) is employed for feature selection, and the feature matrix is generated by fusing a single feature. The extreme gradient boosting (XGBoost), an integrated machine learning algorithm based on decision tree, is used for model training and improves model performance through parameter adjustment. Two rigorous independent data sets demonstrated the superiority of HSM6AP in identifying methylation sites. HSM6AP is an advanced predictor that can be directly employed by users (especially non-professional users) to predict methylation sites. Users can access our related tools and data sets at the following website: The codes of our tool can be publicly accessible at .
最近的研究表明,RNA 甲基化修饰可以影响 RNA 的转录、代谢、剪接和稳定性。此外,RNA 甲基化修饰与癌症、肥胖症和其他疾病有关。本文基于人类基因组信息和机器学习,讨论了融合序列和基因水平特征提取对甲基化位点识别准确性的影响。通过发现新特征,揭示了现有计算工具的显著局限性。(1)大多数预测模型仅基于序列特征,使用 SVM 或随机森林作为分类方法。(2)受样本数量限制,模型可能无法达到良好的性能。为了建立更好的甲基化位点预测模型,我们必须为训练样本设置特定的加权策略,并找到更强大和信息丰富的特征矩阵来建立全面的模型。在本文中,我们提出了 HSM6AP,这是一种基于多重权重和特征拼接的 N6-甲基腺苷(m6A)高精度预测器。与现有方法相比,HSM6AP 在训练过程中创造性地对样本进行加权,并探索了广泛的特征。最大相关性-最大距离(MRMD)用于特征选择,通过融合单个特征生成特征矩阵。极端梯度提升(XGBoost)是一种基于决策树的集成机器学习算法,用于模型训练,并通过参数调整提高模型性能。两个严格独立的数据集证明了 HSM6AP 在识别甲基化位点方面的优越性。HSM6AP 是一种先进的预测器,用户(尤其是非专业用户)可以直接使用它来预测甲基化位点。用户可以访问我们的相关工具和数据集:我们工具的代码可以在公开获取: