一种新的核苷酸序列位置特异性编码算法（SeqPose）及其在增强子检测中的应用。

A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers.

机构信息

Health Informatics Lab, College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun 130012, China.

School of Mathematics, Jilin University, Changchun 130012, China.

出版信息

Int J Mol Sci. 2021 Mar 17;22(6):3079. doi: 10.3390/ijms22063079.

DOI:10.3390/ijms22063079

PMID:33802922

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8002641/

Abstract

Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detection and transcriptional regulation strength evaluation of the enhancers remain a major bioinformatics challenge. Most of the current studies utilized the statistical features of short fixed-length nucleotide sequences. This study introduces the location information of each k-mer (SeqPose) into the encoding strategy of a DNA sequence and employs the attention mechanism in the two-layer bi-directional long-short term memory (BD-LSTM) model (spEnhancer) for the enhancer detection problem. The first layer of the delivered classifier discriminates between enhancers and non-enhancers, and the second layer evaluates the transcriptional regulation strength of the detected enhancer. The SeqPose-encoded features are selected by the Chi-squared test, and 45 positions are removed from further analysis. The existing studies may focus on selecting the statistical DNA sequence descriptors with large contributions to the prediction models. This study does not utilize these statistical DNA sequence descriptors. Then the word vector of the SeqPose-encoded features is obtained by using the word embedding layer. This study hypothesizes that different word vector features may contribute differently to the enhancer detection model, and assigns different weights to these word vectors through the attention mechanism in the BD-LSTM model. The previous study generously provided the training and independent test datasets, and the proposed spEnhancer is compared with the three existing state-of-the-art studies using the same experimental procedure. The leave-one-out validation data on the training dataset shows that the proposed spEnhancer achieves similar detection performances as the three existing studies. While spEnhancer achieves the best overall performance metric MCC for both of the two binary classification problems on the independent test dataset. The experimental data shows that the strategy of removing redundant positions (SeqPose) may help improve the DNA sequence-based prediction models. spEnhancer may serve well as a complementary model to the existing studies, especially for the novel query enhancers that are not included in the training dataset.

摘要

增强子是发挥组织特异性调控作用的短基因组区域，通常作用于远程编码区域。增强子存在于原核生物和真核生物基因组中，其检测有助于更好地理解转录调控机制。增强子的准确检测和转录调控强度评估仍然是一个主要的生物信息学挑战。目前的大多数研究都利用了短固定长度核苷酸序列的统计特征。本研究将每个 k-mer 的位置信息（SeqPose）引入 DNA 序列的编码策略中，并在两层双向长短期记忆（BD-LSTM）模型（spEnhancer）中使用注意力机制来解决增强子检测问题。所提出的分类器的第一层用于区分增强子和非增强子，第二层用于评估检测到的增强子的转录调控强度。通过卡方检验选择 SeqPose 编码特征，然后将 45 个位置从进一步分析中删除。现有的研究可能集中于选择对预测模型有较大贡献的统计 DNA 序列描述符。本研究不使用这些统计 DNA 序列描述符。然后通过字嵌入层获得 SeqPose 编码特征的字向量。本研究假设不同的字向量特征可能对增强子检测模型有不同的贡献，并通过 BD-LSTM 模型中的注意力机制为这些字向量分配不同的权重。先前的研究慷慨地提供了训练和独立测试数据集，本研究使用相同的实验程序将 spEnhancer 与三种现有的最先进的研究进行了比较。训练数据集上的留一验证数据表明，spEnhancer 与三种现有研究的检测性能相似。虽然 spEnhancer 在独立测试数据集上的两个二分类问题的整体性能指标 MCC 上都取得了最佳性能。实验数据表明，去除冗余位置的策略（SeqPose）可能有助于提高基于 DNA 序列的预测模型。spEnhancer 可以作为现有研究的补充模型，特别是对于未包含在训练数据集中的新查询增强子。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60f1/8002641/c2c680885805/ijms-22-03079-g001.jpg

相似文献

A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers.

Int J Mol Sci. 2021 Mar 17;22(6):3079. doi: 10.3390/ijms22063079.

iEnhancer-GAN: A Deep Learning Framework in Combination with Word Embedding and Sequence Generative Adversarial Net to Identify Enhancers and Their Strength.

Int J Mol Sci. 2021 Mar 30;22(7):3589. doi: 10.3390/ijms22073589.

iEnhancer-SKNN: a stacking ensemble learning-based method for enhancer identification and classification using sequence information.

Brief Funct Genomics. 2023 May 18;22(3):302-311. doi: 10.1093/bfgp/elac057.

A deep learning framework for enhancer prediction using word embedding and sequence generation.

Biophys Chem. 2022 Jul;286:106822. doi: 10.1016/j.bpc.2022.106822. Epub 2022 May 5.

iEnhancer-KL: A Novel Two-Layer Predictor for Identifying Enhancers by Position Specific of Nucleotide Composition.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2809-2815. doi: 10.1109/TCBB.2021.3053608. Epub 2021 Dec 8.

EnhancerPred2.0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron-ion interaction potential feature selection.

Mol Biosyst. 2017 Mar 28;13(4):767-774. doi: 10.1039/c7mb00054e.

Integrative machine learning framework for the identification of cell-specific enhancers from the human genome.

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab252.

Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou's trinucleotide composition.

Comput Methods Programs Biomed. 2017 Jul;146:69-75. doi: 10.1016/j.cmpb.2017.05.008. Epub 2017 May 26.

iEnhancer-DCSA: identifying enhancers via dual-scale convolution and spatial attention.

BMC Genomics. 2023 Jul 13;24(1):393. doi: 10.1186/s12864-023-09468-1.

iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding.

Anal Biochem. 2019 Apr 15;571:53-61. doi: 10.1016/j.ab.2019.02.017. Epub 2019 Feb 26.

引用本文的文献

Genome language modeling (GLM): a beginner's cheat sheet.

Biol Methods Protoc. 2025 Mar 25;10(1):bpaf022. doi: 10.1093/biomethods/bpaf022. eCollection 2025.

DeepEnhancerPPO: An Interpretable Deep Learning Approach for Enhancer Classification.

Int J Mol Sci. 2024 Dec 2;25(23):12942. doi: 10.3390/ijms252312942.

CapsEnhancer: An Effective Computational Framework for Identifying Enhancers Based on Chaos Game Representation and Capsule Network.

J Chem Inf Model. 2024 Jul 22;64(14):5725-5736. doi: 10.1021/acs.jcim.4c00546. Epub 2024 Jun 30.

iEnhancer-DCSA: identifying enhancers via dual-scale convolution and spatial attention.

BMC Genomics. 2023 Jul 13;24(1):393. doi: 10.1186/s12864-023-09468-1.

DeepITEH: a deep learning framework for identifying tissue-specific eRNAs from the human genome.

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad375.

Genomic benchmarks: a collection of datasets for genomic sequence classification.

BMC Genom Data. 2023 May 1;24(1):25. doi: 10.1186/s12863-023-01123-8.

Enhancer-LSTMAtt: A Bi-LSTM and Attention-Based Deep Learning Method for Enhancer Recognition.

Biomolecules. 2022 Jul 17;12(7):995. doi: 10.3390/biom12070995.

本文引用的文献

iEnhancer-ECNN: identifying enhancers and their strength using ensembles of convolutional neural networks.

BMC Genomics. 2019 Dec 24;20(Suppl 9):951. doi: 10.1186/s12864-019-6336-3.

Feature selection may improve deep neural networks for the bioinformatics problems.

Bioinformatics. 2020 Mar 1;36(5):1542-1552. doi: 10.1093/bioinformatics/btz763.

Genomic encoding of transcriptional burst kinetics.

Nature. 2019 Jan;565(7738):251-254. doi: 10.1038/s41586-018-0836-1. Epub 2019 Jan 2.

iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach.

Bioinformatics. 2018 Nov 15;34(22):3835-3842. doi: 10.1093/bioinformatics/bty458.

A new method for enhancer prediction based on deep belief network.

BMC Bioinformatics. 2017 Oct 16;18(Suppl 12):418. doi: 10.1186/s12859-017-1828-0.

EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features.

Sci Rep. 2016 Dec 12;6:38741. doi: 10.1038/srep38741.

iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition.

Bioinformatics. 2016 Feb 1;32(3):362-9. doi: 10.1093/bioinformatics/btv604. Epub 2015 Oct 17.

Integrating diverse datasets improves developmental enhancer prediction.

PLoS Comput Biol. 2014 Jun 26;10(6):e1003677. doi: 10.1371/journal.pcbi.1003677. eCollection 2014 Jun.

RFECS: a random-forest based algorithm for enhancer identification from chromatin state.

PLoS Comput Biol. 2013;9(3):e1002968. doi: 10.1371/journal.pcbi.1002968. Epub 2013 Mar 14.

The role of long non-coding RNA in transcriptional gene silencing.

Curr Opin Plant Biol. 2012 Nov;15(5):517-22. doi: 10.1016/j.pbi.2012.08.008. Epub 2012 Sep 6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种新的核苷酸序列位置特异性编码算法（SeqPose）及其在增强子检测中的应用。

A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers.

机构信息

School of Mathematics, Jilin University, Changchun 130012, China.

出版信息

Int J Mol Sci. 2021 Mar 17;22(6):3079. doi: 10.3390/ijms22063079.

DOI:10.3390/ijms22063079

PMID:33802922

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8002641/

Abstract

摘要

一种新的核苷酸序列位置特异性编码算法（SeqPose）及其在增强子检测中的应用。

A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

一种新的核苷酸序列位置特异性编码算法（SeqPose）及其在增强子检测中的应用。

A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献