Suppr超能文献

4mCBERT:一种基于集成学习策略,通过序列和化学衍生信息识别DNA N4-甲基胞嘧啶位点的计算工具。

4mCBERT: A computing tool for the identification of DNA N4-methylcytosine sites by sequence- and chemical-derived information based on ensemble learning strategies.

作者信息

Yang Sen, Yang Zexi, Yang Jun

机构信息

School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China; The Affiliated Changzhou No 2 People's Hospital of Nanjing Medical University, Changzhou 213164, China.

School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China.

出版信息

Int J Biol Macromol. 2023 Mar 15;231:123180. doi: 10.1016/j.ijbiomac.2023.123180. Epub 2023 Jan 13.

Abstract

N4-methylcytosine (4mC) is an important DNA chemical modification pattern which is a new methylation modification discovered in recent years and plays critical roles in gene expression regulation, defense against invading genetic elements, genomic imprinting, and so on. Identifying 4mC site from DNA sequence segment contributes to discovering more novel modification patterns. In this paper, we present a model called 4mCBERT that encodes DNA sequence segments by sequence characteristics including one-hot, electron-ion interaction pseudopotential, nucleotide chemical property, word2vec and chemical information containing physicochemical properties (PCP), chemical bidirectional encoder representations from transformers (chemical BERT) and employs ensemble learning framework to develop a prediction model. PCP and chemical BERT features are firstly constructed and applied to predict 4mC sites and show positive contributions to identifying 4mC. For the Matthew's Correlation Coefficient, 4mCBERT significantly outperformed other state-of-the-art models on six independent benchmark datasets including A. thaliana, C. elegans, D. melanogaster, E. coli, G. Pickering, and G. subterraneous by 4.32 % to 24.39 %, 2.52 % to 31.65 %, 2 % to 16.49 %, 6.63 % to 35.15, 8.59 % to 61.85 %, and 8.45 % to 34.45 %. Moreover, 4mCBERT is designed to allow users to predict 4mC sites and retrain 4mC prediction models. In brief, 4mCBERT shows higher performance on six benchmark datasets by incorporating sequence- and chemical-driven information and is available at http://cczubio.top/4mCBERT and https://github.com/abcair/4mCBERT.

摘要

N4-甲基胞嘧啶(4mC)是一种重要的DNA化学修饰模式,是近年来发现的一种新的甲基化修饰,在基因表达调控、抵御入侵遗传元件、基因组印记等方面发挥着关键作用。从DNA序列片段中识别4mC位点有助于发现更多新颖的修饰模式。在本文中,我们提出了一种名为4mCBERT的模型,该模型通过包括独热编码、电子-离子相互作用赝势、核苷酸化学性质、词向量以及包含物理化学性质(PCP)的化学信息、来自变换器的化学双向编码器表示(化学BERT)等序列特征对DNA序列片段进行编码,并采用集成学习框架开发预测模型。首先构建了PCP和化学BERT特征并将其应用于预测4mC位点,结果表明这些特征对识别4mC有积极贡献。对于马修斯相关系数,在包括拟南芥、秀丽隐杆线虫、黑腹果蝇、大肠杆菌、皮氏嗜盐菌和地下嗜盐菌在内的六个独立基准数据集上,4mCBERT显著优于其他现有最先进模型,优势分别为4.32%至24.39%、2.52%至31.65%、2%至16.49%、6.63%至35.15%、8.59%至61.85%以及8.45%至34.45%。此外,4mCBERT的设计允许用户预测4mC位点并重新训练4mC预测模型。简而言之,4mCBERT通过整合序列驱动和化学驱动的信息在六个基准数据集上表现出更高的性能,可在http://cczubio.top/4mCBERT和https://github.com/abcair/4mCBERT获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验