iEnhancer-ELM：基于增强子语言模型提取位置相关的多尺度上下文信息来改进增强子识别。

iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models.

作者信息

Li Jiahao, Wu Zhourun, Lin Wenhao, Luo Jiawei, Zhang Jun, Chen Qingcai, Chen Junjie

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.

Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.

出版信息

Bioinform Adv. 2023 Mar 25;3(1):vbad043. doi: 10.1093/bioadv/vbad043. eCollection 2023.

DOI:10.1093/bioadv/vbad043

PMID:37113248

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10125906/

Abstract

MOTIVATION

Enhancers are important -regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many feature extraction methods have been proposed to improve the performance of enhancer identification, they cannot learn position-related multiscale contextual information from raw DNA sequences.

RESULTS

In this article, we propose a novel enhancer identification method (iEnhancer-ELM) based on BERT-like enhancer language models. iEnhancer-ELM tokenizes DNA sequences with multi-scale -mers and extracts contextual information of different scale -mers related with their positions via an multi-head attention mechanism. We first evaluate the performance of different scale -mers, then ensemble them to improve the performance of enhancer identification. The experimental results on two popular benchmark datasets show that our model outperforms state-of-the-art methods. We further illustrate the interpretability of iEnhancer-ELM. For a case study, we discover 30 enhancer motifs via a 3-mer-based model, where 12 of motifs are verified by STREME and JASPAR, demonstrating our model has a potential ability to unveil the biological mechanism of enhancer.

AVAILABILITY AND IMPLEMENTATION

The models and associated code are available at https://github.com/chen-bioinfo/iEnhancer-ELM.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

动机

增强子是重要的调控元件，可调节广泛的生物学功能并增强靶基因的转录。尽管已经提出了许多特征提取方法来提高增强子识别的性能，但它们无法从原始DNA序列中学习与位置相关的多尺度上下文信息。

结果

在本文中，我们提出了一种基于类BERT增强子语言模型的新型增强子识别方法（iEnhancer-ELM）。iEnhancer-ELM使用多尺度k-mer对DNA序列进行分词，并通过多头注意力机制提取与其位置相关的不同尺度k-mer的上下文信息。我们首先评估不同尺度k-mer的性能，然后将它们集成以提高增强子识别的性能。在两个流行的基准数据集上的实验结果表明，我们的模型优于现有方法。我们进一步说明了iEnhancer-ELM的可解释性。在一个案例研究中，我们通过基于3-mer的模型发现了30个增强子基序，其中12个基序已通过STREME和JASPAR验证，这表明我们的模型具有揭示增强子生物学机制的潜在能力。

可用性和实现

模型及相关代码可在https://github.com/chen-bioinfo/iEnhancer-ELM获取。

补充信息

补充数据可在网上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d803/10125906/ba4cb2b61c49/vbad043f1.jpg

相似文献

iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models.iEnhancer-ELM：基于增强子语言模型提取位置相关的多尺度上下文信息来改进增强子识别。

Bioinform Adv. 2023 Mar 25;3(1):vbad043. doi: 10.1093/bioadv/vbad043. eCollection 2023.

iEnhancer-SKNN: a stacking ensemble learning-based method for enhancer identification and classification using sequence information.iEnhancer-SKNN：一种基于堆叠集成学习的方法，用于使用序列信息进行增强子识别和分类。

Brief Funct Genomics. 2023 May 18;22(3):302-311. doi: 10.1093/bfgp/elac057.

iEnhancer-XG: interpretable sequence-based enhancers and their strength predictor.iEnhancer-XG：基于序列的可解释增强子及其强度预测器。

Bioinformatics. 2021 May 23;37(8):1060-1067. doi: 10.1093/bioinformatics/btaa914.

iEnhancer-ECNN: identifying enhancers and their strength using ensembles of convolutional neural networks.iEnhancer-ECNN：使用卷积神经网络的集合来识别增强子及其强度。

BMC Genomics. 2019 Dec 24;20(Suppl 9):951. doi: 10.1186/s12864-019-6336-3.

iEnhancer-DLRA: identification of enhancers and their strengths by a self-attention fusion strategy for local and global features.iEnhancer-DLRA：通过自注意力融合策略识别增强子及其强度，用于局部和全局特征。

Brief Funct Genomics. 2022 Sep 16;21(5):399-407. doi: 10.1093/bfgp/elac023.

iEnhancer-RD: Identification of enhancers and their strength using RKPK features and deep neural networks.iEnhancer-RD：利用RKPK特征和深度神经网络识别增强子及其强度。

Anal Biochem. 2021 Oct 1;630:114318. doi: 10.1016/j.ab.2021.114318. Epub 2021 Aug 5.

iEnhancer-DCSA: identifying enhancers via dual-scale convolution and spatial attention.iEnhancer-DCSA：通过双尺度卷积和空间注意力识别增强子。

BMC Genomics. 2023 Jul 13;24(1):393. doi: 10.1186/s12864-023-09468-1.

iEnhancer-KL: A Novel Two-Layer Predictor for Identifying Enhancers by Position Specific of Nucleotide Composition.iEnhancer-KL：一种通过核苷酸组成的位置特异性识别增强子的新型双层预测器。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2809-2815. doi: 10.1109/TCBB.2021.3053608. Epub 2021 Dec 8.

iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree.iEnhancer-MFGBDT：通过融合多种特征和梯度提升决策树来识别增强子及其强度。

Math Biosci Eng. 2021 Oct 14;18(6):8797-8814. doi: 10.3934/mbe.2021434.

iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition.iEnhancer-2L：一种通过伪 k-元核苷酸组成识别增强子及其强度的两层预测器。

Bioinformatics. 2016 Feb 1;32(3):362-9. doi: 10.1093/bioinformatics/btv604. Epub 2015 Oct 17.

引用本文的文献

DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.DNA序列分析全景：对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。

Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.

A review on the applications of Transformer-based language models for nucleotide sequence analysis.基于Transformer的语言模型在核苷酸序列分析中的应用综述。

Comput Struct Biotechnol J. 2025 Mar 18;27:1244-1254. doi: 10.1016/j.csbj.2025.03.024. eCollection 2025.

Beyond digital twins: the role of foundation models in enhancing the interpretability of multiomics modalities in precision medicine.超越数字孪生：基础模型在提高精准医学中多组学模式的可解释性方面的作用。

FEBS Open Bio. 2025 Aug;15(8):1192-1208. doi: 10.1002/2211-5463.70003. Epub 2025 Feb 24.

Directed evolution of antimicrobial peptides using multi-objective zeroth-order optimization.利用多目标零阶优化进行抗菌肽的定向进化。

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae715.

CapsEnhancer: An Effective Computational Framework for Identifying Enhancers Based on Chaos Game Representation and Capsule Network.CapsEnhancer：一种基于混沌游戏表示和胶囊网络的有效识别增强子的计算框架。

J Chem Inf Model. 2024 Jul 22;64(14):5725-5736. doi: 10.1021/acs.jcim.4c00546. Epub 2024 Jun 30.

A deep learning model for DNA enhancer prediction based on nucleotide position aware feature encoding.一种基于核苷酸位置感知特征编码的DNA增强子预测深度学习模型。

iScience. 2024 May 19;27(6):110030. doi: 10.1016/j.isci.2024.110030. eCollection 2024 Jun 21.

Predmoter-cross-species prediction of plant promoter and enhancer regions.植物启动子和增强子区域的启动子跨物种预测

Bioinform Adv. 2024 May 24;4(1):vbae074. doi: 10.1093/bioadv/vbae074. eCollection 2024.

ADH-Enhancer: an attention-based deep hybrid framework for enhancer identification and strength prediction.ADH-Enhancer：一种基于注意力的深度混合框架，用于增强子识别和强度预测。

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae030.

本文引用的文献

CFAGO: cross-fusion of network and attributes based on attention mechanism for protein function prediction.CFAGO：基于注意力机制的网络和属性交叉融合的蛋白质功能预测方法。

Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad123.

sAMPpred-GAT: prediction of antimicrobial peptide by graph attention network and predicted peptide structure.sAMPpred-GAT：基于图注意力网络和预测肽结构的抗菌肽预测。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac715.

iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations.iDNA-ABF：用于可解释的 DNA 甲基化预测的多尺度深度生物语言学习模型。

Genome Biol. 2022 Oct 17;23(1):219. doi: 10.1186/s13059-022-02780-1.

Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training.基于特定任务预训练改进用于DNA-蛋白质结合预测的人类基因组语言模型。

Interdiscip Sci. 2023 Mar;15(1):32-43. doi: 10.1007/s12539-022-00537-9. Epub 2022 Sep 22.

TPpred-ATMV: therapeutic peptide prediction by adaptive multi-view tensor learning model.TPpred-ATMV：基于自适应多视图张量学习模型的治疗性肽预测。

Bioinformatics. 2022 May 13;38(10):2712-2718. doi: 10.1093/bioinformatics/btac200.

DeepIDP-2L: protein intrinsically disordered region prediction by combining convolutional attention network and hierarchical attention network.DeepIDP-2L：通过组合卷积注意网络和层次注意网络进行蛋白质固有无序区域预测。

Bioinformatics. 2022 Feb 7;38(5):1252-1260. doi: 10.1093/bioinformatics/btab810.

JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles.JASPAR 2022：转录因子结合谱开放获取数据库的第 9 个版本。

Nucleic Acids Res. 2022 Jan 7;50(D1):D165-D173. doi: 10.1093/nar/gkab1113.

BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models.BioSeq-BLM：一个基于生物语言模型分析 DNA、RNA 和蛋白质序列的平台。

Nucleic Acids Res. 2021 Dec 16;49(22):e129. doi: 10.1093/nar/gkab829.

Integrative machine learning framework for the identification of cell-specific enhancers from the human genome.从人类基因组中识别细胞特异性增强子的综合机器学习框架。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab252.

iEnhancer-GAN: A Deep Learning Framework in Combination with Word Embedding and Sequence Generative Adversarial Net to Identify Enhancers and Their Strength.iEnhancer-GAN：一种结合词嵌入和序列生成对抗网络以识别增强子及其强度的深度学习框架。

Int J Mol Sci. 2021 Mar 30;22(7):3589. doi: 10.3390/ijms22073589.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

iEnhancer-ELM：基于增强子语言模型提取位置相关的多尺度上下文信息来改进增强子识别。

iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献