• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

TExCNN:利用预训练模型从基因组序列预测基因表达

TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences.

作者信息

Dong Guohao, Wu Yuqian, Huang Lan, Li Fei, Zhou Fengfeng

机构信息

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.

College of Computer Science and Technology, Jilin University, Changchun 130012, China.

出版信息

Genes (Basel). 2024 Dec 12;15(12):1593. doi: 10.3390/genes15121593.

DOI:10.3390/genes15121593
PMID:39766860
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11675716/
Abstract

BACKGROUND/OBJECTIVES: Understanding the relationship between DNA sequences and gene expression levels is of significant biological importance. Recent advancements have demonstrated the ability of deep learning to predict gene expression levels directly from genomic data. However, traditional methods are limited by basic word encoding techniques, which fail to capture the inherent features and patterns of DNA sequences.

METHODS

We introduce TExCNN, a novel framework that integrates the pre-trained models DNABERT and DNABERT-2 to generate word embeddings for DNA sequences. We partitioned the DNA sequences into manageable segments and computed their respective embeddings using the pre-trained models. These embeddings were then utilized as inputs to our deep learning framework, which was based on convolutional neural network.

RESULTS

TExCNN outperformed current state-of-the-art models, achieving an average R score of 0.622, compared to the 0.596 score achieved by the DeepLncLoc model, which is based on the Word2Vec model and a text convolutional neural network. Furthermore, when the sequence length was extended from 10,500 bp to 50,000 bp, TExCNN achieved an even higher average R score of 0.639. The prediction accuracy improved further when additional biological features were incorporated.

CONCLUSIONS

Our experimental results demonstrate that the use of pre-trained models for word embedding generation significantly improves the accuracy of predicting gene expression. The proposed TExCNN pipeline performes optimally with longer DNA sequences and is adaptable for both cell-type-independent and cell-type-dependent predictions.

摘要

背景/目的:了解DNA序列与基因表达水平之间的关系具有重要的生物学意义。最近的进展表明,深度学习能够直接从基因组数据预测基因表达水平。然而,传统方法受限于基本的词编码技术,无法捕捉DNA序列的固有特征和模式。

方法

我们引入了TExCNN,这是一个新颖的框架,它集成了预训练模型DNABERT和DNABERT-2来生成DNA序列的词嵌入。我们将DNA序列划分为可管理的片段,并使用预训练模型计算它们各自的嵌入。然后将这些嵌入用作基于卷积神经网络的深度学习框架的输入。

结果

TExCNN优于当前的最先进模型,平均R评分为0.622,而基于Word2Vec模型和文本卷积神经网络的DeepLncLoc模型的评分为0.596。此外,当序列长度从10,500 bp扩展到50,000 bp时,TExCNN的平均R评分更高,达到0.639。当纳入额外的生物学特征时,预测准确性进一步提高。

结论

我们的实验结果表明,使用预训练模型生成词嵌入可显著提高预测基因表达的准确性。所提出的TExCNN管道在较长的DNA序列上表现最佳,适用于与细胞类型无关和与细胞类型有关的预测。

相似文献

1
TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences.TExCNN:利用预训练模型从基因组序列预测基因表达
Genes (Basel). 2024 Dec 12;15(12):1593. doi: 10.3390/genes15121593.
2
DeepLncLoc: a deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding.DeepLncLoc:一种基于子序列嵌入的深度学习框架,用于长非编码 RNA 亚细胞定位预测。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab360.
3
DeepDualEnhancer: A Dual-Feature Input DNABert Based Deep Learning Method for Enhancer Recognition.DeepDualEnhancer:一种基于双特征输入的 DNA 语言模型的深度学习方法,用于增强子识别。
Int J Mol Sci. 2024 Nov 1;25(21):11744. doi: 10.3390/ijms252111744.
4
iEnhancer-GAN: A Deep Learning Framework in Combination with Word Embedding and Sequence Generative Adversarial Net to Identify Enhancers and Their Strength.iEnhancer-GAN:一种结合词嵌入和序列生成对抗网络以识别增强子及其强度的深度学习框架。
Int J Mol Sci. 2021 Mar 30;22(7):3589. doi: 10.3390/ijms22073589.
5
MABAL: a Novel Deep-Learning Architecture for Machine-Assisted Bone Age Labeling.MABAL:一种用于机器辅助骨龄标注的新型深度学习架构。
J Digit Imaging. 2018 Aug;31(4):513-519. doi: 10.1007/s10278-018-0053-3.
6
Enhancing personalized gene expression prediction from DNA sequences using genomic foundation models.利用基因组基础模型增强 DNA 序列的个性化基因表达预测。
HGG Adv. 2024 Oct 10;5(4):100347. doi: 10.1016/j.xhgg.2024.100347. Epub 2024 Aug 27.
7
MfeCNN: Mixture Feature Embedding Convolutional Neural Network for Data Mapping.MfeCNN:用于数据映射的混合特征嵌入卷积神经网络。
IEEE Trans Nanobioscience. 2018 Jul;17(3):165-171. doi: 10.1109/TNB.2018.2841053. Epub 2018 May 28.
8
Brain tumor segmentation and detection in MRI using convolutional neural networks and VGG16.使用卷积神经网络和VGG16在磁共振成像(MRI)中进行脑肿瘤分割与检测
Cancer Biomark. 2025 Mar;42(3):18758592241311184. doi: 10.1177/18758592241311184. Epub 2025 Apr 4.
9
Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。
J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.
10
Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。
PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.

本文引用的文献

1
Nucleotide Transformer: building and evaluating robust foundation models for human genomics.核苷酸变换器:构建和评估用于人类基因组学的强大基础模型。
Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.
2
Predicting gene and protein expression levels from DNA and protein sequences with Perceiver.利用 Perceiver 从 DNA 和蛋白质序列预测基因和蛋白质表达水平。
Comput Methods Programs Biomed. 2023 Jun;234:107504. doi: 10.1016/j.cmpb.2023.107504. Epub 2023 Mar 22.
3
Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers.
目前基于序列的模型可以捕捉启动子中的基因表达决定因素,但大多忽略了远端增强子。
Genome Biol. 2023 Mar 27;24(1):56. doi: 10.1186/s13059-023-02899-9.
4
Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers.利用转换器从 DNA 序列和转录后信息预测基因表达水平。
Comput Methods Programs Biomed. 2022 Oct;225:107035. doi: 10.1016/j.cmpb.2022.107035. Epub 2022 Aug 7.
5
Assessing comparative importance of DNA sequence and epigenetic modifications on gene expression using a deep convolutional neural network.使用深度卷积神经网络评估DNA序列和表观遗传修饰对基因表达的相对重要性。
Comput Struct Biotechnol J. 2022 Jul 13;20:3814-3823. doi: 10.1016/j.csbj.2022.07.014. eCollection 2022.
6
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
7
Sequence determinants of human gene regulatory elements.人类基因调控元件的序列决定因素。
Nat Genet. 2022 Mar;54(3):283-294. doi: 10.1038/s41588-021-01009-4. Epub 2022 Feb 21.
8
Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用,从序列中有效预测基因表达。
Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.
9
DeepLncLoc: a deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding.DeepLncLoc:一种基于子序列嵌入的深度学习框架,用于长非编码 RNA 亚细胞定位预测。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab360.
10
Transcriptional Regulation by (Super)Enhancers: From Discovery to Mechanisms.转录调控因子 (超)增强子:从发现到机制。
Annu Rev Genomics Hum Genet. 2021 Aug 31;22:127-146. doi: 10.1146/annurev-genom-122220-093818. Epub 2021 May 5.