Suppr超能文献

BERT-m7G:一种基于 BERT 和堆叠集成的转换器架构,用于从序列信息中识别 RNA N7-甲基鸟苷位点。

BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information.

机构信息

College of Information Engineering, Shanghai Maritime University, 1550 Haigang Ave., Shanghai 201306, China.

School of Computer Science and Engineering, Southeast University, Nanjing 214135, China.

出版信息

Comput Math Methods Med. 2021 Aug 25;2021:7764764. doi: 10.1155/2021/7764764. eCollection 2021.

Abstract

As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.

摘要

作为 RNA 最普遍的转录后修饰之一,N7-甲基鸟苷(m7G)在基因表达调控中起着至关重要的作用。准确识别转录组中的 m7G 位点对于更好地揭示其潜在的功能机制是非常宝贵的。尽管高通量实验方法可以精确地定位 m7G 位点,但它们价格昂贵且耗时。因此,设计一种能够准确识别 m7G 位点的高效计算方法是至关重要的。在这项研究中,我们提出了一种新的方法,通过在生物信息学中结合基于 BERT 的多语言模型来表示 RNA 序列的信息。首先,我们将 RNA 序列视为自然语句,然后使用来自转换器的双向编码器表示(BERT)模型将其转换为固定长度的数字矩阵。其次,构建了一种基于弹性网络方法的特征选择方案,以消除冗余特征并保留重要特征。最后,将选择的特征子集输入到堆叠集成分类器中以预测 m7G 位点,并使用树结构 Parzen 估计器(TPE)方法调整分类器的超参数。通过 10 倍交叉验证,BERT-m7G 的性能以 95.48%的 ACC 和 0.9100 的 MCC 来衡量。实验结果表明,该方法在 m7G 修饰的识别方面明显优于最新的预测方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b08c/8413034/8b91c7efa8fb/CMMM2021-7764764.001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验