College of Engineering, Westlake University, Hangzhou, China; College of Sciences, Nanjing Agricultural University, Nanjing, China.
College of Sciences, Nanjing Agricultural University, Nanjing, China.
Genomics. 2024 Jan;116(1):110749. doi: 10.1016/j.ygeno.2023.110749. Epub 2023 Nov 25.
N4-acetylcytidine (ac4C) is a highly conserved RNA modification that plays a crucial role in various biological processes. Accurately identifying ac4C sites is of paramount importance for gaining a deeper understanding of their regulatory mechanisms. Nevertheless, the existing experimental techniques for ac4C site identification are characterized by limitations in terms of cost-effectiveness, while the performance of current computational methods in accurately identifying ac4C sites requires further enhancement.
In this paper, we present MetaAc4C, an advanced deep learning model that leverages pre-trained bidirectional encoder representations from transformers (BERT). The model is based on a bi-directional long short-term memory network (BLSTM) architecture, incorporating attention mechanism and residual connection. To address the issue of data imbalance, we adapt generative adversarial networks to generate synthetic feature samples. On the independent test set, MetaAc4C surpasses the current state-of-the-art ac4C prediction model, exhibiting improvements in terms of ACC, MCC, and AUROC by 2.36%, 4.76%, and 3.11%, respectively, on the unbalanced dataset. When evaluated on the balanced dataset, MetaAc4C achieves improvements in ACC, MCC, and AUROC by 2.6%, 5.11%, and 1.01%, respectively. Notably, our approach of utilizing WGAN-GP augmented training RNA samples demonstrates even superior performance compared to the SMOTE oversampling method.
N4-乙酰胞苷(ac4C)是一种高度保守的 RNA 修饰,在各种生物过程中发挥着至关重要的作用。准确识别 ac4C 位点对于深入了解其调控机制至关重要。然而,现有的 ac4C 位点鉴定实验技术在成本效益方面存在局限性,而当前计算方法在准确识别 ac4C 位点方面的性能需要进一步提高。
在本文中,我们提出了 MetaAc4C,这是一种利用来自转换器的预训练双向编码器表示(BERT)的先进深度学习模型。该模型基于双向长短期记忆网络(BLSTM)架构,结合注意力机制和残差连接。为了解决数据不平衡问题,我们采用生成对抗网络生成合成特征样本。在独立测试集上,MetaAc4C 优于当前最先进的 ac4C 预测模型,在不平衡数据集上,ACC、MCC 和 AUROC 的分别提高了 2.36%、4.76%和 3.11%。当在平衡数据集上进行评估时,MetaAc4C 的 ACC、MCC 和 AUROC 分别提高了 2.6%、5.11%和 1.01%。值得注意的是,我们利用 WGAN-GP 增强训练 RNA 样本的方法甚至比 SMOTE 过采样方法表现出更优越的性能。