Health Informatics Lab, College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun 130012, China.
School of Mathematics, Jilin University, Changchun 130012, China.
Int J Mol Sci. 2021 Mar 17;22(6):3079. doi: 10.3390/ijms22063079.
Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detection and transcriptional regulation strength evaluation of the enhancers remain a major bioinformatics challenge. Most of the current studies utilized the statistical features of short fixed-length nucleotide sequences. This study introduces the location information of each k-mer (SeqPose) into the encoding strategy of a DNA sequence and employs the attention mechanism in the two-layer bi-directional long-short term memory (BD-LSTM) model (spEnhancer) for the enhancer detection problem. The first layer of the delivered classifier discriminates between enhancers and non-enhancers, and the second layer evaluates the transcriptional regulation strength of the detected enhancer. The SeqPose-encoded features are selected by the Chi-squared test, and 45 positions are removed from further analysis. The existing studies may focus on selecting the statistical DNA sequence descriptors with large contributions to the prediction models. This study does not utilize these statistical DNA sequence descriptors. Then the word vector of the SeqPose-encoded features is obtained by using the word embedding layer. This study hypothesizes that different word vector features may contribute differently to the enhancer detection model, and assigns different weights to these word vectors through the attention mechanism in the BD-LSTM model. The previous study generously provided the training and independent test datasets, and the proposed spEnhancer is compared with the three existing state-of-the-art studies using the same experimental procedure. The leave-one-out validation data on the training dataset shows that the proposed spEnhancer achieves similar detection performances as the three existing studies. While spEnhancer achieves the best overall performance metric MCC for both of the two binary classification problems on the independent test dataset. The experimental data shows that the strategy of removing redundant positions (SeqPose) may help improve the DNA sequence-based prediction models. spEnhancer may serve well as a complementary model to the existing studies, especially for the novel query enhancers that are not included in the training dataset.
增强子是发挥组织特异性调控作用的短基因组区域,通常作用于远程编码区域。增强子存在于原核生物和真核生物基因组中,其检测有助于更好地理解转录调控机制。增强子的准确检测和转录调控强度评估仍然是一个主要的生物信息学挑战。目前的大多数研究都利用了短固定长度核苷酸序列的统计特征。本研究将每个 k-mer 的位置信息(SeqPose)引入 DNA 序列的编码策略中,并在两层双向长短期记忆(BD-LSTM)模型(spEnhancer)中使用注意力机制来解决增强子检测问题。所提出的分类器的第一层用于区分增强子和非增强子,第二层用于评估检测到的增强子的转录调控强度。通过卡方检验选择 SeqPose 编码特征,然后将 45 个位置从进一步分析中删除。现有的研究可能集中于选择对预测模型有较大贡献的统计 DNA 序列描述符。本研究不使用这些统计 DNA 序列描述符。然后通过字嵌入层获得 SeqPose 编码特征的字向量。本研究假设不同的字向量特征可能对增强子检测模型有不同的贡献,并通过 BD-LSTM 模型中的注意力机制为这些字向量分配不同的权重。先前的研究慷慨地提供了训练和独立测试数据集,本研究使用相同的实验程序将 spEnhancer 与三种现有的最先进的研究进行了比较。训练数据集上的留一验证数据表明,spEnhancer 与三种现有研究的检测性能相似。虽然 spEnhancer 在独立测试数据集上的两个二分类问题的整体性能指标 MCC 上都取得了最佳性能。实验数据表明,去除冗余位置的策略(SeqPose)可能有助于提高基于 DNA 序列的预测模型。spEnhancer 可以作为现有研究的补充模型,特别是对于未包含在训练数据集中的新查询增强子。