深度基序仪表盘：使用深度神经网络可视化和理解基因组序列

DEEP MOTIF DASHBOARD: VISUALIZING AND UNDERSTANDING GENOMIC SEQUENCES USING DEEP NEURAL NETWORKS.

作者信息

Lanchantin Jack, Singh Ritambhara, Wang Beilun, Qi Yanjun

机构信息

Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA,

出版信息

Pac Symp Biocomput. 2017;22:254-265. doi: 10.1142/9789813207813_0025.

DOI:10.1142/9789813207813_0025

PMID:27896980

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5787355/

Abstract

Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence's saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them.

摘要

深度神经网络（DNN）模型最近在转录因子结合（TFBS）位点分类任务中取得了最先进的预测准确率。然而，目前尚不清楚这些方法如何识别有意义的DNA序列信号，也不清楚它们为何能深入了解转录因子与特定位置的结合原因。在本文中，我们提出了一个名为深度基序仪表盘（DeMo仪表盘）的工具包，它提供了一套可视化策略，用于从用于TFBS分类的深度神经网络模型中提取基序或序列模式。我们展示了如何可视化和理解三种重要的DNN模型：卷积网络、循环网络和卷积循环网络。我们的第一种可视化方法是找到测试序列的显著性图，该图使用一阶导数来描述每个核苷酸在进行最终预测时的重要性。其次，考虑到循环模型以时间顺序进行预测（从TFBS序列的一端到另一端），我们引入了时间输出分数，它表示模型对序列输入随时间的预测分数。最后，一种特定类别的可视化策略通过随机梯度优化找到给定TFBS正类的最优输入序列。我们的实验结果表明，在这三种架构中，卷积循环架构的性能最佳。可视化技术表明，CNN-RNN通过对基序及其之间的依赖性进行建模来进行预测。

相似文献

DEEP MOTIF DASHBOARD: VISUALIZING AND UNDERSTANDING GENOMIC SEQUENCES USING DEEP NEURAL NETWORKS.深度基序仪表盘：使用深度神经网络可视化和理解基因组序列

Pac Symp Biocomput. 2017;22:254-265. doi: 10.1142/9789813207813_0025.

Predicting enhancers with deep convolutional neural networks.使用深度卷积神经网络预测增强子。

BMC Bioinformatics. 2017 Dec 1;18(Suppl 13):478. doi: 10.1186/s12859-017-1878-3.

Representation learning of genomic sequence motifs with convolutional neural networks.利用卷积神经网络进行基因组序列基元的表示学习。

PLoS Comput Biol. 2019 Dec 19;15(12):e1007560. doi: 10.1371/journal.pcbi.1007560. eCollection 2019 Dec.

Visualizing complex feature interactions and feature sharing in genomic deep neural networks.可视化基因组深度学习神经网络中的复杂特征交互和特征共享。

BMC Bioinformatics. 2019 Jul 19;20(1):401. doi: 10.1186/s12859-019-2957-4.

Opening up the blackbox: an interpretable deep neural network-based classifier for cell-type specific enhancer predictions.打开黑箱：一种基于可解释深度神经网络的细胞类型特异性增强子预测分类器。

BMC Syst Biol. 2016 Aug 1;10 Suppl 2(Suppl 2):54. doi: 10.1186/s12918-016-0302-3.

Prediction of TF-Binding Site by Inclusion of Higher Order Position Dependencies.通过包含更高阶位置相关性来预测 TF 结合位点。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jul-Aug;17(4):1383-1393. doi: 10.1109/TCBB.2019.2892124. Epub 2019 Jan 10.

High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites.用于预测 DNA-蛋白质结合位点的高阶卷积神经网络架构。

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1184-1192. doi: 10.1109/TCBB.2018.2819660. Epub 2018 Mar 26.

BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning.BERT-TFBS：一种基于迁移学习的用于预测转录因子结合位点的新型基于BERT的模型。

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae195.

Recurrent Neural Network for Predicting Transcription Factor Binding Sites.用于预测转录因子结合位点的递归神经网络。

Sci Rep. 2018 Oct 15;8(1):15270. doi: 10.1038/s41598-018-33321-1.

Evaluation of deep learning approaches for modeling transcription factor sequence specificity.用于转录因子序列特异性建模的深度学习方法评估。

Genomics. 2021 Nov;113(6):3774-3781. doi: 10.1016/j.ygeno.2021.09.009. Epub 2021 Sep 14.

引用本文的文献

Deep Genomics: Deep Learning-Based Analysis of Genome-Sequenced Data for Identification of Gene Alterations.深度基因组学：基于深度学习的基因组测序数据分析以识别基因改变

Methods Mol Biol. 2025;2952:335-367. doi: 10.1007/978-1-0716-4690-8_20.

Predicting TF-Target Gene Association Using a Heterogeneous Network and Enhanced Negative Sampling.使用异质网络和增强负采样预测转录因子-靶基因关联

Bioinform Biol Insights. 2025 Feb 25;19:11779322251316130. doi: 10.1177/11779322251316130. eCollection 2025.

Advancing Regulatory Genomics With Machine Learning.利用机器学习推动监管基因组学发展。

Bioinform Biol Insights. 2024 Dec 24;18:11779322241249562. doi: 10.1177/11779322241249562. eCollection 2024.

Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation.用于隐私保护合成基因组序列生成的全基因组信息语言模型

bioRxiv. 2024 Sep 24:2024.09.18.612131. doi: 10.1101/2024.09.18.612131.

DeepCRISTL: deep transfer learning to predict CRISPR/Cas9 on-target editing efficiency in specific cellular contexts.DeepCRISTL：用于在特定细胞环境中预测 CRISPR/Cas9 靶向编辑效率的深度迁移学习。

Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae481.

Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models.深度学习在基因组学中的应用：从早期神经网络到现代大型语言模型。

Int J Mol Sci. 2023 Nov 1;24(21):15858. doi: 10.3390/ijms242115858.

Knowledge graph embedding for profiling the interaction between transcription factors and their target genes.知识图谱嵌入技术在转录因子与其靶基因相互作用分析中的应用

PLoS Comput Biol. 2023 Jun 20;19(6):e1011207. doi: 10.1371/journal.pcbi.1011207. eCollection 2023 Jun.

A neural network model to screen feature genes for pancreatic cancer.用于筛选胰腺癌特征基因的神经网络模型。

BMC Bioinformatics. 2023 May 11;24(1):193. doi: 10.1186/s12859-023-05322-z.

Development and performance evaluation of an artificial intelligence algorithm using cell-free DNA fragment distance for non-invasive prenatal testing (aiD-NIPT).一种使用游离DNA片段距离进行无创产前检测的人工智能算法（aiD-NIPT）的开发与性能评估

Front Genet. 2022 Nov 29;13:999587. doi: 10.3389/fgene.2022.999587. eCollection 2022.

A review of deep learning applications in human genomics using next-generation sequencing data.深度学习在人类基因组学中应用的研究进展：利用下一代测序数据

Hum Genomics. 2022 Jul 25;16(1):26. doi: 10.1186/s40246-022-00396-x.

本文引用的文献

DeepChrome: deep-learning for predicting gene expression from histone modifications.深度铬：用于从组蛋白修饰预测基因表达的深度学习

Bioinformatics. 2016 Sep 1;32(17):i639-i648. doi: 10.1093/bioinformatics/btw427.

Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.巴塞特：利用深度卷积神经网络学习可及基因组的调控密码。

Genome Res. 2016 Jul;26(7):990-9. doi: 10.1101/gr.200535.115. Epub 2016 May 3.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences.DanQ：一种用于量化DNA序列功能的卷积与循环相结合的深度神经网络。

Nucleic Acids Res. 2016 Jun 20;44(11):e107. doi: 10.1093/nar/gkw226. Epub 2016 Apr 15.

JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles.JASPAR 2016：转录因子结合谱开放获取数据库的重大扩展与更新

Nucleic Acids Res. 2016 Jan 4;44(D1):D110-5. doi: 10.1093/nar/gkv1176. Epub 2015 Nov 3.

Predicting effects of noncoding variants with deep learning-based sequence model.使用基于深度学习的序列模型预测非编码变异的影响。

Nat Methods. 2015 Oct;12(10):931-4. doi: 10.1038/nmeth.3547. Epub 2015 Aug 24.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.通过深度学习预测 DNA 和 RNA 结合蛋白的序列特异性。

Nat Biotechnol. 2015 Aug;33(8):831-8. doi: 10.1038/nbt.3300. Epub 2015 Jul 27.

On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation.关于通过逐层相关性传播对非线性分类器决策进行逐像素解释

PLoS One. 2015 Jul 10;10(7):e0130140. doi: 10.1371/journal.pone.0130140. eCollection 2015.

SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps.SeqGL在全基因组调控元件图谱中识别上下文相关的结合信号。

PLoS Comput Biol. 2015 May 27;11(5):e1004271. doi: 10.1371/journal.pcbi.1004271. eCollection 2015 May.

Enhanced regulatory sequence prediction using gapped k-mer features.使用带缺口的 k-mer 特征增强调控序列预测。

PLoS Comput Biol. 2014 Jul 17;10(7):e1003711. doi: 10.1371/journal.pcbi.1003711. eCollection 2014 Jul.

An integrated encyclopedia of DNA elements in the human genome.人类基因组中 DNA 元件的综合百科全书。

Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验