• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

4mCBERT:一种基于集成学习策略,通过序列和化学衍生信息识别DNA N4-甲基胞嘧啶位点的计算工具。

4mCBERT: A computing tool for the identification of DNA N4-methylcytosine sites by sequence- and chemical-derived information based on ensemble learning strategies.

作者信息

Yang Sen, Yang Zexi, Yang Jun

机构信息

School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China; The Affiliated Changzhou No 2 People's Hospital of Nanjing Medical University, Changzhou 213164, China.

School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China.

出版信息

Int J Biol Macromol. 2023 Mar 15;231:123180. doi: 10.1016/j.ijbiomac.2023.123180. Epub 2023 Jan 13.

DOI:10.1016/j.ijbiomac.2023.123180
PMID:36646347
Abstract

N4-methylcytosine (4mC) is an important DNA chemical modification pattern which is a new methylation modification discovered in recent years and plays critical roles in gene expression regulation, defense against invading genetic elements, genomic imprinting, and so on. Identifying 4mC site from DNA sequence segment contributes to discovering more novel modification patterns. In this paper, we present a model called 4mCBERT that encodes DNA sequence segments by sequence characteristics including one-hot, electron-ion interaction pseudopotential, nucleotide chemical property, word2vec and chemical information containing physicochemical properties (PCP), chemical bidirectional encoder representations from transformers (chemical BERT) and employs ensemble learning framework to develop a prediction model. PCP and chemical BERT features are firstly constructed and applied to predict 4mC sites and show positive contributions to identifying 4mC. For the Matthew's Correlation Coefficient, 4mCBERT significantly outperformed other state-of-the-art models on six independent benchmark datasets including A. thaliana, C. elegans, D. melanogaster, E. coli, G. Pickering, and G. subterraneous by 4.32 % to 24.39 %, 2.52 % to 31.65 %, 2 % to 16.49 %, 6.63 % to 35.15, 8.59 % to 61.85 %, and 8.45 % to 34.45 %. Moreover, 4mCBERT is designed to allow users to predict 4mC sites and retrain 4mC prediction models. In brief, 4mCBERT shows higher performance on six benchmark datasets by incorporating sequence- and chemical-driven information and is available at http://cczubio.top/4mCBERT and https://github.com/abcair/4mCBERT.

摘要

N4-甲基胞嘧啶(4mC)是一种重要的DNA化学修饰模式,是近年来发现的一种新的甲基化修饰,在基因表达调控、抵御入侵遗传元件、基因组印记等方面发挥着关键作用。从DNA序列片段中识别4mC位点有助于发现更多新颖的修饰模式。在本文中,我们提出了一种名为4mCBERT的模型,该模型通过包括独热编码、电子-离子相互作用赝势、核苷酸化学性质、词向量以及包含物理化学性质(PCP)的化学信息、来自变换器的化学双向编码器表示(化学BERT)等序列特征对DNA序列片段进行编码,并采用集成学习框架开发预测模型。首先构建了PCP和化学BERT特征并将其应用于预测4mC位点,结果表明这些特征对识别4mC有积极贡献。对于马修斯相关系数,在包括拟南芥、秀丽隐杆线虫、黑腹果蝇、大肠杆菌、皮氏嗜盐菌和地下嗜盐菌在内的六个独立基准数据集上,4mCBERT显著优于其他现有最先进模型,优势分别为4.32%至24.39%、2.52%至31.65%、2%至16.49%、6.63%至35.15%、8.59%至61.85%以及8.45%至34.45%。此外,4mCBERT的设计允许用户预测4mC位点并重新训练4mC预测模型。简而言之,4mCBERT通过整合序列驱动和化学驱动的信息在六个基准数据集上表现出更高的性能,可在http://cczubio.top/4mCBERT和https://github.com/abcair/4mCBERT获取。

相似文献

1
4mCBERT: A computing tool for the identification of DNA N4-methylcytosine sites by sequence- and chemical-derived information based on ensemble learning strategies.4mCBERT:一种基于集成学习策略,通过序列和化学衍生信息识别DNA N4-甲基胞嘧啶位点的计算工具。
Int J Biol Macromol. 2023 Mar 15;231:123180. doi: 10.1016/j.ijbiomac.2023.123180. Epub 2023 Jan 13.
2
Accurate prediction of DNA N-methylcytosine sites via boost-learning various types of sequence features.通过提升学习多种类型的序列特征来准确预测 DNA N-甲基胞嘧啶位点。
BMC Genomics. 2020 Sep 11;21(1):627. doi: 10.1186/s12864-020-07033-8.
3
Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli.Deep-4mCW2V:一种基于序列的预测工具,用于鉴定大肠杆菌中的 N4-甲基胞嘧啶位点。
Methods. 2022 Jul;203:558-563. doi: 10.1016/j.ymeth.2021.07.011. Epub 2021 Aug 2.
4
A novel method for predicting DNA N-methylcytosine sites based on deep forest algorithm.一种基于深度森林算法预测DNA N-甲基胞嘧啶位点的新方法。
J Bioinform Comput Biol. 2023 Feb;21(1):2350003. doi: 10.1142/S0219720023500038. Epub 2023 Mar 9.
5
A Deep Neural Network for Identifying DNA N4-Methylcytosine Sites.用于识别DNA N4-甲基胞嘧啶位点的深度神经网络
Front Genet. 2020 Mar 6;11:209. doi: 10.3389/fgene.2020.00209. eCollection 2020.
6
Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method.利用机器学习方法对小鼠基因组中N4-甲基胞嘧啶位点进行计算识别。
Math Biosci Eng. 2021 Apr 15;18(4):3348-3363. doi: 10.3934/mbe.2021167.
7
4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction.4mCPred:用于 DNA N4-甲基胞嘧啶位点预测的机器学习方法。
Bioinformatics. 2019 Feb 15;35(4):593-601. doi: 10.1093/bioinformatics/bty668.
8
Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species.探索基于序列的特征,以提高在多个物种中预测 DNA N4-甲基胞嘧啶位点的能力。
Bioinformatics. 2019 Apr 15;35(8):1326-1333. doi: 10.1093/bioinformatics/bty824.
9
DeepTorrent: a deep learning-based approach for predicting DNA N4-methylcytosine sites.DeepTorrent:一种基于深度学习的方法,用于预测 DNA N4-甲基胞嘧啶位点。
Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa124.
10
DeepSF-4mC: A deep learning model for predicting DNA cytosine 4mC methylation sites leveraging sequence features.DeepSF-4mC:一种利用序列特征预测 DNA 胞嘧啶 4mC 甲基化位点的深度学习模型。
Comput Biol Med. 2024 Mar;171:108166. doi: 10.1016/j.compbiomed.2024.108166. Epub 2024 Feb 16.

引用本文的文献

1
DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.DNA序列分析全景:对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。
Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.
2
Ensemble learning-based predictor for driver synonymous mutation with sequence representation.基于集成学习的具有序列表征的驱动同义突变预测器
PLoS Comput Biol. 2025 Jan 6;21(1):e1012744. doi: 10.1371/journal.pcbi.1012744. eCollection 2025 Jan.
3
LncLSTA: a versatile predictor unveiling subcellular localization of lncRNAs through long-short term attention.
LncLSTA:一种通过长短期注意力揭示lncRNA亚细胞定位的多功能预测工具。
Bioinform Adv. 2024 Nov 22;5(1):vbae173. doi: 10.1093/bioadv/vbae173. eCollection 2025.
4
RiceSNP-ABST: a deep learning approach to identify abiotic stress-associated single nucleotide polymorphisms in rice.水稻SNP-ABST:一种用于识别水稻中非生物胁迫相关单核苷酸多态性的深度学习方法。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae702.
5
iResNetDM: An interpretable deep learning approach for four types of DNA methylation modification prediction.iResNetDM:一种用于四种DNA甲基化修饰预测的可解释深度学习方法。
Comput Struct Biotechnol J. 2024 Nov 13;23:4214-4221. doi: 10.1016/j.csbj.2024.11.006. eCollection 2024 Dec.
6
RiceSNP-BST: a deep learning framework for predicting biotic stress-associated SNPs in rice.RiceSNP-BST:一种用于预测水稻中与生物胁迫相关的 SNP 的深度学习框架。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae599.
7
Domain-knowledge enabled ensemble learning of 5-formylcytosine (f5C) modification sites.基于领域知识的5-甲酰基胞嘧啶(f5C)修饰位点集成学习
Comput Struct Biotechnol J. 2024 Aug 8;23:3175-3185. doi: 10.1016/j.csbj.2024.08.004. eCollection 2024 Dec.
8
DLC-ac4C: A Prediction Model for N4-acetylcytidine Sites in Human mRNA Based on DenseNet and Bidirectional LSTM Methods.DLC-ac4C:一种基于密集连接网络(DenseNet)和双向长短期记忆网络(Bidirectional LSTM)方法的人类mRNA中N4-乙酰胞苷位点预测模型
Curr Genomics. 2023 Nov 22;24(3):171-186. doi: 10.2174/0113892029270191231013111911.
9
Computational Approaches: A New Frontier in Cancer Research.计算方法:癌症研究的新前沿。
Comb Chem High Throughput Screen. 2024;27(13):1861-1876. doi: 10.2174/0113862073265604231106112203.
10
ACP-BC: A Model for Accurate Identification of Anticancer Peptides Based on Fusion Features of Bidirectional Long Short-Term Memory and Chemically Derived Information.ACP-BC:基于双向长短期记忆和化学衍生信息融合特征的抗癌肽准确识别模型。
Int J Mol Sci. 2023 Oct 22;24(20):15447. doi: 10.3390/ijms242015447.