Jia Yanna, Zhang Zilong, Yan Shankai, Zhang Qingchen, Wei Leyi, Cui Feifei
School of Computer Science and Technology, Hainan University, Haikou 570228, China.
Centre for Artificial Intelligence driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR, China; School of Informatics, Xiamen University, Xiamen, China.
Int J Biol Macromol. 2024 Dec;282(Pt 3):136940. doi: 10.1016/j.ijbiomac.2024.136940. Epub 2024 Oct 28.
RNA N4-acetylcytidine (ac4C) modification plays a crucial role in gene expression regulation. However, existing prediction methods face limitations in capturing RNA sequence features, particularly in handling sequence complexity and long-range dependencies. To enhance the accuracy of RNA-ac4C modification sites prediction, this study introduces, for the first time, the transformer-based RNAErnie pre-trained model, which deeply extracts semantic information from RNA sequences. This model is combined with six traditional feature extraction methods (such as One-hot, ENAC, etc.) to form a multidimensional feature set. On this basis, we propose the Voting-ac4C model, which utilizes a deep neural network for feature selection. The selected features are then fed into a soft voting ensemble learning model, integrating the strengths of various machine learning algorithms to predict RNA-ac4C modification sites. Experimental results demonstrate that compared to the state-of-the-art methods, Voting-ac4C achieves significant improvements across multiple metrics, including AUC, SN, SP, ACC, and MCC. This study provides a novel approach for RNA modification sites prediction and highlights the potential applications of pre-trained models in biological sequence analysis.
RNA N4-乙酰胞苷(ac4C)修饰在基因表达调控中起着关键作用。然而,现有的预测方法在捕捉RNA序列特征方面存在局限性,尤其是在处理序列复杂性和长程依赖性方面。为了提高RNA-ac4C修饰位点预测的准确性,本研究首次引入了基于Transformer的RNAErnie预训练模型,该模型从RNA序列中深度提取语义信息。该模型与六种传统特征提取方法(如独热编码、ENAC等)相结合,形成一个多维特征集。在此基础上,我们提出了Voting-ac4C模型,该模型利用深度神经网络进行特征选择。然后将所选特征输入到一个软投票集成学习模型中,整合各种机器学习算法的优势来预测RNA-ac4C修饰位点。实验结果表明,与现有最先进的方法相比,Voting-ac4C在包括AUC、SN、SP、ACC和MCC在内的多个指标上都取得了显著改进。本研究为RNA修饰位点预测提供了一种新方法,并突出了预训练模型在生物序列分析中的潜在应用。