Suppr超能文献

iEnhancer-ELM:基于增强子语言模型提取位置相关的多尺度上下文信息来改进增强子识别。

iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models.

作者信息

Li Jiahao, Wu Zhourun, Lin Wenhao, Luo Jiawei, Zhang Jun, Chen Qingcai, Chen Junjie

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.

Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.

出版信息

Bioinform Adv. 2023 Mar 25;3(1):vbad043. doi: 10.1093/bioadv/vbad043. eCollection 2023.

Abstract

MOTIVATION

Enhancers are important -regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many feature extraction methods have been proposed to improve the performance of enhancer identification, they cannot learn position-related multiscale contextual information from raw DNA sequences.

RESULTS

In this article, we propose a novel enhancer identification method (iEnhancer-ELM) based on BERT-like enhancer language models. iEnhancer-ELM tokenizes DNA sequences with multi-scale -mers and extracts contextual information of different scale -mers related with their positions via an multi-head attention mechanism. We first evaluate the performance of different scale -mers, then ensemble them to improve the performance of enhancer identification. The experimental results on two popular benchmark datasets show that our model outperforms state-of-the-art methods. We further illustrate the interpretability of iEnhancer-ELM. For a case study, we discover 30 enhancer motifs via a 3-mer-based model, where 12 of motifs are verified by STREME and JASPAR, demonstrating our model has a potential ability to unveil the biological mechanism of enhancer.

AVAILABILITY AND IMPLEMENTATION

The models and associated code are available at https://github.com/chen-bioinfo/iEnhancer-ELM.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

动机

增强子是重要的调控元件,可调节广泛的生物学功能并增强靶基因的转录。尽管已经提出了许多特征提取方法来提高增强子识别的性能,但它们无法从原始DNA序列中学习与位置相关的多尺度上下文信息。

结果

在本文中,我们提出了一种基于类BERT增强子语言模型的新型增强子识别方法(iEnhancer-ELM)。iEnhancer-ELM使用多尺度k-mer对DNA序列进行分词,并通过多头注意力机制提取与其位置相关的不同尺度k-mer的上下文信息。我们首先评估不同尺度k-mer的性能,然后将它们集成以提高增强子识别的性能。在两个流行的基准数据集上的实验结果表明,我们的模型优于现有方法。我们进一步说明了iEnhancer-ELM的可解释性。在一个案例研究中,我们通过基于3-mer的模型发现了30个增强子基序,其中12个基序已通过STREME和JASPAR验证,这表明我们的模型具有揭示增强子生物学机制的潜在能力。

可用性和实现

模型及相关代码可在https://github.com/chen-bioinfo/iEnhancer-ELM获取。

补充信息

补充数据可在网上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d803/10125906/ba4cb2b61c49/vbad043f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验