BERT-m7G：一种基于 BERT 和堆叠集成的转换器架构，用于从序列信息中识别 RNA N7-甲基鸟苷位点。

BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information.

机构信息

College of Information Engineering, Shanghai Maritime University, 1550 Haigang Ave., Shanghai 201306, China.

School of Computer Science and Engineering, Southeast University, Nanjing 214135, China.

出版信息

Comput Math Methods Med. 2021 Aug 25;2021:7764764. doi: 10.1155/2021/7764764. eCollection 2021.

DOI:10.1155/2021/7764764

PMID:34484416

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8413034/

Abstract

As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.

摘要

作为 RNA 最普遍的转录后修饰之一，N7-甲基鸟苷（m7G）在基因表达调控中起着至关重要的作用。准确识别转录组中的 m7G 位点对于更好地揭示其潜在的功能机制是非常宝贵的。尽管高通量实验方法可以精确地定位 m7G 位点，但它们价格昂贵且耗时。因此，设计一种能够准确识别 m7G 位点的高效计算方法是至关重要的。在这项研究中，我们提出了一种新的方法，通过在生物信息学中结合基于 BERT 的多语言模型来表示 RNA 序列的信息。首先，我们将 RNA 序列视为自然语句，然后使用来自转换器的双向编码器表示（BERT）模型将其转换为固定长度的数字矩阵。其次，构建了一种基于弹性网络方法的特征选择方案，以消除冗余特征并保留重要特征。最后，将选择的特征子集输入到堆叠集成分类器中以预测 m7G 位点，并使用树结构 Parzen 估计器（TPE）方法调整分类器的超参数。通过 10 倍交叉验证，BERT-m7G 的性能以 95.48%的 ACC 和 0.9100 的 MCC 来衡量。实验结果表明，该方法在 m7G 修饰的识别方面明显优于最新的预测方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b08c/8413034/8b91c7efa8fb/CMMM2021-7764764.001.jpg

相似文献

BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information.BERT-m7G：一种基于 BERT 和堆叠集成的转换器架构，用于从序列信息中识别 RNA N7-甲基鸟苷位点。

Comput Math Methods Med. 2021 Aug 25;2021:7764764. doi: 10.1155/2021/7764764. eCollection 2021.

m7G-DPP: Identifying N7-methylguanosine sites based on dinucleotide physicochemical properties of RNA.m7G-DPP：基于RNA二核苷酸理化性质识别N7-甲基鸟苷位点。

Biophys Chem. 2021 Dec;279:106697. doi: 10.1016/j.bpc.2021.106697. Epub 2021 Oct 5.

Iterative feature representation algorithm to improve the predictive performance of N7-methylguanosine sites.迭代特征表示算法提高 N7-甲基鸟苷位点的预测性能。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa278.

Identifying N7-methylguanosine sites by integrating multiple features.通过整合多种特征来鉴定 N7-甲基鸟苷位点。

Biopolymers. 2022 Feb;113(2):e23480. doi: 10.1002/bip.23480. Epub 2021 Oct 28.

THRONE: A New Approach for Accurate Prediction of Human RNA N7-Methylguanosine Sites.THRONE：一种准确预测人类 RNA N7-甲基鸟苷位点的新方法。

J Mol Biol. 2022 Jun 15;434(11):167549. doi: 10.1016/j.jmb.2022.167549. Epub 2022 Mar 16.

Prediction of N7-methylguanosine sites in human RNA based on optimal sequence features.基于最优序列特征预测人 RNA 中的 N7-甲基鸟苷位点。

Genomics. 2020 Nov;112(6):4342-4347. doi: 10.1016/j.ygeno.2020.07.035. Epub 2020 Jul 25.

TMSC-m7G: A transformer architecture based on multi-sense-scaled embedding features and convolutional neural network to identify RNA N7-methylguanosine sites.TMSC-m7G：一种基于多感官尺度嵌入特征和卷积神经网络的变压器架构，用于识别RNA N7-甲基鸟苷位点。

Comput Struct Biotechnol J. 2023 Dec 1;23:129-139. doi: 10.1016/j.csbj.2023.11.052. eCollection 2024 Dec.

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.基于 BERT 和二维卷积神经网络的变压器架构，用于从序列信息中识别 DNA 增强子。

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab005.

m7GPredictor: An improved machine learning-based model for predicting internal m7G modifications using sequence properties.m7GPredictor：一种基于机器学习的改进模型，用于使用序列特性预测内部 m7G 修饰。

Anal Biochem. 2020 Nov 15;609:113905. doi: 10.1016/j.ab.2020.113905. Epub 2020 Aug 14.

m7GHub V2.0: an updated database for decoding the N7-methylguanosine (m7G) epitranscriptome.m7GHub V2.0：一个更新的数据库，用于解码 N7-甲基鸟苷（m7G）转录后修饰组。

Nucleic Acids Res. 2024 Jan 5;52(D1):D203-D212. doi: 10.1093/nar/gkad789.

引用本文的文献

Application of artificial intelligence large language models in drug target discovery.人工智能大语言模型在药物靶点发现中的应用。

Front Pharmacol. 2025 Jul 8;16:1597351. doi: 10.3389/fphar.2025.1597351. eCollection 2025.

EBMGP: a deep learning model for genomic prediction based on Elastic Net feature selection and bidirectional encoder representations from transformer's embedding and multi-head attention pooling.EBMGP：一种基于弹性网络特征选择以及来自Transformer嵌入和多头注意力池化的双向编码器表示的基因组预测深度学习模型。

Theor Appl Genet. 2025 Apr 19;138(5):103. doi: 10.1007/s00122-025-04894-z.

Foundation models in bioinformatics.生物信息学中的基础模型。

Natl Sci Rev. 2025 Jan 25;12(4):nwaf028. doi: 10.1093/nsr/nwaf028. eCollection 2025 Apr.

Beyond digital twins: the role of foundation models in enhancing the interpretability of multiomics modalities in precision medicine.超越数字孪生：基础模型在提高精准医学中多组学模式的可解释性方面的作用。

FEBS Open Bio. 2025 Aug;15(8):1192-1208. doi: 10.1002/2211-5463.70003. Epub 2025 Feb 24.

RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.RNA序列分析全景：任务类型、数据库、数据集、词嵌入方法及语言模型的全面综述

Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.

N7-methylguanosine modification in cancers: from mechanisms to therapeutic potential.癌症中的N7-甲基鸟苷修饰：从机制到治疗潜力

J Hematol Oncol. 2025 Jan 29;18(1):12. doi: 10.1186/s13045-025-01665-7.

From multi-omics to predictive biomarker: AI in tumor microenvironment.从多组学到预测性生物标志物：肿瘤微环境中的人工智能

Front Immunol. 2024 Dec 23;15:1514977. doi: 10.3389/fimmu.2024.1514977. eCollection 2024.

Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook.医疗保健中的多模态大型语言模型：应用、挑战和未来展望。

J Med Internet Res. 2024 Sep 25;26:e59505. doi: 10.2196/59505.

Advances in Protein-Ligand Binding Affinity Prediction via Deep Learning: A Comprehensive Study of Datasets, Data Preprocessing Techniques, and Model Architectures.基于深度学习的蛋白质-配体结合亲和力预测方法进展：数据集、数据预处理技术和模型架构的综合研究。

Curr Drug Targets. 2024;25(15):1041-1065. doi: 10.2174/0113894501330963240905083020.

The circadian rhythm: A key variable in aging?昼夜节律：衰老的关键变量？

Aging Cell. 2024 Nov;23(11):e14268. doi: 10.1111/acel.14268. Epub 2024 Jul 30.

本文引用的文献

RNA modifications in cardiovascular diseases, the potential therapeutic targets.心血管疾病中的 RNA 修饰：潜在的治疗靶点。

Life Sci. 2021 Aug 1;278:119565. doi: 10.1016/j.lfs.2021.119565. Epub 2021 May 6.

Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation.使用基于分布式特征表示的深度学习模型识别蔷薇科基因组中的DNA N4-甲基胞嘧啶位点。

Comput Struct Biotechnol J. 2021 Mar 19;19:1612-1619. doi: 10.1016/j.csbj.2021.03.015. eCollection 2021.

DNN-m6A: A Cross-Species Method for Identifying RNA N6-Methyladenosine Sites Based on Deep Neural Network with Multi-Information Fusion.基于多信息融合的深度神经网络的跨物种 RNA N6-甲基腺苷位点识别方法 DNN-m6A

Genes (Basel). 2021 Feb 28;12(3):354. doi: 10.3390/genes12030354.

FAD-BERT: Improved prediction of FAD binding sites using pre-training of deep bidirectional transformers.FAD-BERT：使用深度双向转换器的预训练改进 FAD 结合位点预测。

Comput Biol Med. 2021 Apr;131:104258. doi: 10.1016/j.compbiomed.2021.104258. Epub 2021 Feb 8.

Potential regulatory role of epigenetic RNA methylation in cardiovascular diseases.表观遗传 RNA 甲基化在心血管疾病中的潜在调控作用。

Biomed Pharmacother. 2021 May;137:111376. doi: 10.1016/j.biopha.2021.111376. Epub 2021 Feb 13.

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab005.

DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine.DNA 序列通过利用深度学习算法进行自然语言处理，用于识别 N4-甲基胞嘧啶。

Sci Rep. 2021 Jan 8;11(1):212. doi: 10.1038/s41598-020-80430-x.

Succinylation Site Prediction Based on Protein Sequences Using the IFS-LightGBM (BO) Model.基于序列信息的蛋白质琥珀酰化修饰位点预测的 IFS-LightGBM（BO）模型

Comput Math Methods Med. 2020 Nov 10;2020:8858489. doi: 10.1155/2020/8858489. eCollection 2020.

Iterative feature representation algorithm to improve the predictive performance of N7-methylguanosine sites.迭代特征表示算法提高 N7-甲基鸟苷位点的预测性能。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa278.

Anal Biochem. 2020 Nov 15;609:113905. doi: 10.1016/j.ab.2020.113905. Epub 2020 Aug 14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

BERT-m7G：一种基于 BERT 和堆叠集成的转换器架构，用于从序列信息中识别 RNA N7-甲基鸟苷位点。

BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献