用于大规模语音任务的深度卷积神经网络。

Deep Convolutional Neural Networks for large-scale speech tasks.

机构信息

IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States.

Department of Computer Science, University of Toronto, United States.

出版信息

Neural Netw. 2015 Apr;64:39-48. doi: 10.1016/j.neunet.2014.08.005. Epub 2014 Sep 16.

DOI:10.1016/j.neunet.2014.08.005

PMID:25439765

Abstract

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12%-14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.

摘要

卷积神经网络（CNNs）是一种替代类型的神经网络，可用于减少信号中存在的光谱变化和光谱相关性建模。由于语音信号同时具有这两种特性，因此我们假设与深度神经网络（DNNs）相比，CNN 是一种更有效的语音模型。在本文中，我们探讨了将 CNN 应用于大词汇量连续语音识别（LVCSR）任务。首先，我们确定适当的架构，使 CNN 相对于 DNN 更有效地用于 LVCSR 任务。具体来说，我们专注于需要多少个卷积层，多少个隐藏单元是合适的，以及最佳的池化策略是什么。其次，研究如何将 speaker-adapted 特征（由于它们在频率上不服从局部性，因此不能直接由 CNN 建模）纳入 CNN 框架。第三，鉴于序列训练对于语音任务的重要性，我们引入了一种在 CNN 的 Hessian-free 序列训练期间使用 ReLU+dropout 的策略。在 3 个 LVCSR 任务上的实验表明，与强大的 DNN 系统相比，具有所提出的适用于 speaker-adapted 和 ReLU+dropout 思想的 CNN 可以使 WER 相对提高 12%-14%，在这 3 个任务中达到了最新的结果。

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

新学期，新优惠