Lanchantin Jack, Singh Ritambhara, Wang Beilun, Qi Yanjun
Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA,
Pac Symp Biocomput. 2017;22:254-265. doi: 10.1142/9789813207813_0025.
Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence's saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them.
深度神经网络(DNN)模型最近在转录因子结合(TFBS)位点分类任务中取得了最先进的预测准确率。然而,目前尚不清楚这些方法如何识别有意义的DNA序列信号,也不清楚它们为何能深入了解转录因子与特定位置的结合原因。在本文中,我们提出了一个名为深度基序仪表盘(DeMo仪表盘)的工具包,它提供了一套可视化策略,用于从用于TFBS分类的深度神经网络模型中提取基序或序列模式。我们展示了如何可视化和理解三种重要的DNN模型:卷积网络、循环网络和卷积循环网络。我们的第一种可视化方法是找到测试序列的显著性图,该图使用一阶导数来描述每个核苷酸在进行最终预测时的重要性。其次,考虑到循环模型以时间顺序进行预测(从TFBS序列的一端到另一端),我们引入了时间输出分数,它表示模型对序列输入随时间的预测分数。最后,一种特定类别的可视化策略通过随机梯度优化找到给定TFBS正类的最优输入序列。我们的实验结果表明,在这三种架构中,卷积循环架构的性能最佳。可视化技术表明,CNN-RNN通过对基序及其之间的依赖性进行建模来进行预测。