School of Artificial Intelligence, Hebei University of Technology, Tianjin 300400, China.
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
J Chem Inf Model. 2024 Oct 28;64(20):8074-8081. doi: 10.1021/acs.jcim.4c01415. Epub 2024 Oct 5.
N4-acetylcytidine (ac4C) plays a crucial role in regulating cellular biological processes, particularly in gene expression regulation and disease development. However, experiments to identify ac4C in a wet lab are time-consuming and costly, and the learning-based methods struggle to capture the underlying semantic knowledge and relations within sequences. To address this, we propose a deep learning approach called NBCR-ac4C based on pretrained models. Specifically, we employ Nucleotide Transformer and DNABERT2 to construct contextual embedding of nucleotide sequences, which effectively mine and express context relations between different features in the sequence. Convolutional neural network (CNN) and ResNet18 are then applied to further extract shallow and deep knowledge from context embedding. Depending on extensive experiments for the prediction of ac4C sites in nucleotide sequences, we observe that NBCR-ac4C outperforms general learning-based models. It achieves the highest accuracy (ACC) of 83.51% and an Area Under the Receiver Operating Characteristic Curve (AUROC) of 89.58% on an independent test set. Moreover, the proposed model, compared to the current state-of-the-art (SOTA) model LSA-ac4C, demonstrates higher ACC and AUROC by 0.81-3.7% and 0.05-1.58%, respectively. The data set and code are available on https://github.com/2103374200/NBCR to facilitate further discussion on NBCR-ac4C.
N4-乙酰胞苷(ac4C)在调节细胞生物过程中起着至关重要的作用,特别是在基因表达调控和疾病发展中。然而,在实验室中进行 ac4C 的实验既耗时又昂贵,基于学习的方法难以捕捉序列中潜在的语义知识和关系。为了解决这个问题,我们提出了一种基于预训练模型的深度学习方法,称为 NBCR-ac4C。具体来说,我们使用核苷酸转换器和 DNA-BERT2 来构建核苷酸序列的上下文嵌入,有效地挖掘和表达序列中不同特征之间的上下文关系。然后,卷积神经网络(CNN)和 ResNet18 被应用于从上下文嵌入中进一步提取浅层和深层知识。通过对核苷酸序列中 ac4C 位点的预测进行广泛的实验,我们观察到 NBCR-ac4C 优于一般的基于学习的模型。它在独立测试集上实现了 83.51%的最高准确率(ACC)和 89.58%的接收器操作特征曲线下面积(AUROC)。此外,与当前最先进的(SOTA)模型 LSA-ac4C 相比,该模型的 ACC 和 AUROC 分别提高了 0.81-3.7%和 0.05-1.58%。数据集和代码可在 https://github.com/2103374200/NBCR 上获得,以促进对 NBCR-ac4C 的进一步讨论。