Suppr超能文献

基于剪枝预训练 DNABert-Pruning 模型和融合人工特征编码的 4mC 位点识别算法。

4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding.

机构信息

Guangdong University of Technology, Guangzhou, 510000, China.

Guangdong University of Technology, Guangzhou, 510000, China.

出版信息

Anal Biochem. 2024 Jun;689:115492. doi: 10.1016/j.ab.2024.115492. Epub 2024 Mar 6.

Abstract

DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned pre-training DNABert-Pruning model and artificial feature encoding to identify 4 mC sites. The algorithm prunes and compresses the DNABert model, resulting in the pruned pre-training model DNABert-Pruning. This model reduces the number of parameters and removes redundancy from output features, yielding more precise feature representations while upholding accuracy.Simultaneously, the algorithm constructs an artificial feature encoding module to assist the DNABert-Pruning model in feature representation, effectively supplementing the information that is missing from the pre-trained features. The algorithm also introduces the AFF-4mC fusion strategy, which combines artificial feature encoding with the DNABert-Pruning model, to improve the feature representation capability of DNA sequences in multi-semantic spaces and better extract 4 mC sites and the distribution of nucleotide importance within the sequence. In experiments on six independent test sets, the DNABert-4mC algorithm achieved an average AUC value of 93.81%, outperforming seven other advanced algorithms with improvements of 2.05%, 5.02%, 11.32%, 5.90%, 12.02%, 2.42% and 2.34%, respectively.

摘要

4mC 在生物的遗传表达过程中起着至关重要的作用。然而,现有的深度学习算法在表示 DNA 序列特征的能力方面存在不足。在本文中,我们提出了一种基于剪枝预训练的 DNABert-Pruning 模型和人工特征编码融合的 4mC 位点识别算法 DNABert-4mC,用于识别 4mC 位点。该算法对 DNABert 模型进行剪枝和压缩,得到剪枝预训练模型 DNABert-Pruning。该模型减少了参数数量,并去除了输出特征中的冗余,在保持准确性的同时,产生更精确的特征表示。同时,该算法构建了一个人工特征编码模块,以协助 DNABert-Pruning 模型进行特征表示,有效地补充了预训练特征中缺失的信息。该算法还引入了 AFF-4mC 融合策略,将人工特征编码与 DNABert-Pruning 模型相结合,提高了多语义空间中 DNA 序列的特征表示能力,更好地提取 4mC 位点和序列中核苷酸重要性的分布。在六个独立测试集上的实验中,DNABert-4mC 算法的平均 AUC 值达到 93.81%,优于其他七种先进算法,分别提高了 2.05%、5.02%、11.32%、5.90%、12.02%、2.42%和 2.34%。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验