Yang Runtao, Wu Feng, Zhang Chengjin, Zhang Lina
School of Mechanical, Electrical and Information Engineering, Shandong University, Weihai 264209, China.
Int J Mol Sci. 2021 Mar 30;22(7):3589. doi: 10.3390/ijms22073589.
As critical components of DNA, enhancers can efficiently and specifically manipulate the spatial and temporal regulation of gene transcription. Malfunction or dysregulation of enhancers is implicated in a slew of human pathology. Therefore, identifying enhancers and their strength may provide insights into the molecular mechanisms of gene transcription and facilitate the discovery of candidate drug targets. In this paper, a new enhancer and its strength predictor, iEnhancer-GAN, is proposed based on a deep learning framework in combination with the word embedding and sequence generative adversarial net (Seq-GAN). Considering the relatively small training dataset, the Seq-GAN is designed to generate artificial sequences. Given that each functional element in DNA sequences is analogous to a "word" in linguistics, the word segmentation methods are proposed to divide DNA sequences into "words", and the skip-gram model is employed to transform the "words" into digital vectors. In view of the powerful ability to extract high-level abstraction features, a convolutional neural network (CNN) architecture is constructed to perform the identification tasks, and the word vectors of DNA sequences are vertically concatenated to form the embedding matrices as the input of the CNN. Experimental results demonstrate the effectiveness of the Seq-GAN to expand the training dataset, the possibility of applying word segmentation methods to extract "words" from DNA sequences, the feasibility of implementing the skip-gram model to encode DNA sequences, and the powerful prediction ability of the CNN. Compared with other state-of-the-art methods on the training dataset and independent test dataset, the proposed method achieves a significantly improved overall performance. It is anticipated that the proposed method has a certain promotion effect on enhancer related fields.
作为DNA的关键组成部分,增强子能够高效且特异性地调控基因转录的时空过程。增强子的功能异常或失调与一系列人类疾病相关。因此,识别增强子及其强度可能有助于深入了解基因转录的分子机制,并促进候选药物靶点的发现。本文基于深度学习框架,结合词嵌入和序列生成对抗网络(Seq-GAN),提出了一种新的增强子及其强度预测器iEnhancer-GAN。考虑到训练数据集相对较小,Seq-GAN被设计用于生成人工序列。鉴于DNA序列中的每个功能元件类似于语言学中的一个“词”,本文提出了分词方法将DNA序列划分为“词”,并采用跳字模型将这些“词”转换为数字向量。鉴于卷积神经网络(CNN)具有强大的提取高级抽象特征的能力,构建了CNN架构来执行识别任务,并将DNA序列的词向量垂直拼接形成嵌入矩阵作为CNN的输入。实验结果证明了Seq-GAN扩展训练数据集的有效性、应用分词方法从DNA序列中提取“词”的可能性、实现跳字模型对DNA序列进行编码的可行性以及CNN强大的预测能力。与训练数据集和独立测试数据集上的其他现有方法相比,所提出的方法在整体性能上有显著提升。预计该方法将对增强子相关领域有一定的推动作用。