Liu Ruijie, Zhang Yuanpeng, Wang Qi, Zhang Xiaoping
Department of Urology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430000, China.
Shenzhen Huazhong University of Science and Technology Research Institute, Shenzhen, 518000, China.
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae200.
N4-acetylcytidine (ac4C) is a modification found in ribonucleic acid (RNA) related to diseases. Expensive and labor-intensive methods hindered the exploration of ac4C mechanisms and the development of specific anti-ac4C drugs. Therefore, an advanced prediction model for ac4C in RNA is urgently needed. Despite the construction of various prediction models, several limitations exist: (1) insufficient resolution at base level for ac4C sites; (2) lack of information on species other than Homo sapiens; (3) lack of information on RNA other than mRNA; and (4) lack of interpretation for each prediction. In light of these limitations, we have reconstructed the previous benchmark dataset and introduced a new dataset including balanced RNA sequences from multiple species and RNA types, while also providing base-level resolution for ac4C sites. Additionally, we have proposed a novel transformer-based architecture and pipeline for predicting ac4C sites, allowing for highly accurate predictions, visually interpretable results and no restrictions on the length of input RNA sequences. Statistically, our work has improved the accuracy of predicting specific ac4C sites in multiple species from less than 40% to around 85%, achieving a high AUC > 0.9. These results significantly surpass the performance of all existing models.
N4-乙酰胞苷(ac4C)是一种在与疾病相关的核糖核酸(RNA)中发现的修饰。昂贵且 labor-intensive 的方法阻碍了对 ac4C 机制的探索以及特异性抗 ac4C 药物的开发。因此,迫切需要一种先进的 RNA 中 ac4C 的预测模型。尽管构建了各种预测模型,但仍存在一些局限性:(1)对 ac4C 位点的碱基水平分辨率不足;(2)缺乏除智人以外其他物种的信息;(3)缺乏除 mRNA 以外其他 RNA 的信息;以及(4)对每个预测缺乏解释。鉴于这些局限性,我们重建了先前的基准数据集,并引入了一个新数据集,该数据集包括来自多个物种和 RNA 类型的平衡 RNA 序列,同时还为 ac4C 位点提供了碱基水平的分辨率。此外,我们提出了一种基于新型变压器的架构和管道来预测 ac4C 位点,实现了高精度预测、可视化可解释的结果,并且对输入 RNA 序列的长度没有限制。从统计学上讲,我们的工作将多个物种中预测特定 ac4C 位点的准确率从不到 40%提高到了约 85%,实现了大于 0.9 的高 AUC。这些结果显著超过了所有现有模型的性能。