Dong Guohao, Wu Yuqian, Huang Lan, Li Fei, Zhou Fengfeng
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.
College of Computer Science and Technology, Jilin University, Changchun 130012, China.
Genes (Basel). 2024 Dec 12;15(12):1593. doi: 10.3390/genes15121593.
BACKGROUND/OBJECTIVES: Understanding the relationship between DNA sequences and gene expression levels is of significant biological importance. Recent advancements have demonstrated the ability of deep learning to predict gene expression levels directly from genomic data. However, traditional methods are limited by basic word encoding techniques, which fail to capture the inherent features and patterns of DNA sequences.
We introduce TExCNN, a novel framework that integrates the pre-trained models DNABERT and DNABERT-2 to generate word embeddings for DNA sequences. We partitioned the DNA sequences into manageable segments and computed their respective embeddings using the pre-trained models. These embeddings were then utilized as inputs to our deep learning framework, which was based on convolutional neural network.
TExCNN outperformed current state-of-the-art models, achieving an average R score of 0.622, compared to the 0.596 score achieved by the DeepLncLoc model, which is based on the Word2Vec model and a text convolutional neural network. Furthermore, when the sequence length was extended from 10,500 bp to 50,000 bp, TExCNN achieved an even higher average R score of 0.639. The prediction accuracy improved further when additional biological features were incorporated.
Our experimental results demonstrate that the use of pre-trained models for word embedding generation significantly improves the accuracy of predicting gene expression. The proposed TExCNN pipeline performes optimally with longer DNA sequences and is adaptable for both cell-type-independent and cell-type-dependent predictions.
背景/目的:了解DNA序列与基因表达水平之间的关系具有重要的生物学意义。最近的进展表明,深度学习能够直接从基因组数据预测基因表达水平。然而,传统方法受限于基本的词编码技术,无法捕捉DNA序列的固有特征和模式。
我们引入了TExCNN,这是一个新颖的框架,它集成了预训练模型DNABERT和DNABERT-2来生成DNA序列的词嵌入。我们将DNA序列划分为可管理的片段,并使用预训练模型计算它们各自的嵌入。然后将这些嵌入用作基于卷积神经网络的深度学习框架的输入。
TExCNN优于当前的最先进模型,平均R评分为0.622,而基于Word2Vec模型和文本卷积神经网络的DeepLncLoc模型的评分为0.596。此外,当序列长度从10,500 bp扩展到50,000 bp时,TExCNN的平均R评分更高,达到0.639。当纳入额外的生物学特征时,预测准确性进一步提高。
我们的实验结果表明,使用预训练模型生成词嵌入可显著提高预测基因表达的准确性。所提出的TExCNN管道在较长的DNA序列上表现最佳,适用于与细胞类型无关和与细胞类型有关的预测。