基于深度神经网络的语音识别的多分辨率语音分析：在 TIMIT 上的实验。

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.

机构信息

AuDIaS - Audio, Data Intelligence and Speech, Universidad Autónoma de Madrid, Madrid, Spain.

出版信息

PLoS One. 2018 Oct 10;13(10):e0205355. doi: 10.1371/journal.pone.0205355. eCollection 2018.

DOI:10.1371/journal.pone.0205355

PMID:30304055

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6179252/

Abstract

Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.

摘要

语音分析通常从短时傅里叶变换（STFT）开始，这意味着在时间-频率分辨率权衡中选择一个固定点。这种方法与梅尔频率标度滤波器组和离散余弦变换相结合，产生梅尔频率倒谱系数（MFCC），这是过去几十年语音处理中最常见的语音特征。这些特征特别适合以前的隐马尔可夫模型/高斯混合模型（HMM/GMM）在 ASR 中的最新技术。特别是它们产生了非常不相关的小维数特征（通常为 13 个系数加上增量和双增量），这对于对角协方差 GMM 非常方便，可以处理维度的诅咒，并且适用于十年前有限的计算资源。目前，大多数 ASR 系统使用深度神经网络（DNN）代替 GMM 来对声学特征进行建模，这为特征的定义提供了更大的灵活性。特别是，声学特征可以高度相关，并且可以更大，因为 DNN 在处理高维输入方面非常强大。此外，计算硬件已经达到了一个发展水平，使得语音处理中的计算成本不再是一个相关问题。在这种情况下，我们决定重新审视语音分析中的时频分辨率问题，特别是检查多分辨率语音分析（时间和频率）是否有助于使用 DNN 改进声学建模。我们的实验从几个 Kaldi 基线系统开始，这些系统针对著名的 TIMIT 语料库，通过添加多分辨率语音表示来修改它们，这些表示通过连接使用不同时间-频率分辨率计算的不同频谱，以及使用不同时间-频率分辨率的不同后处理和说话人自适应特征来实现。我们的实验表明，使用多分辨率语音表示通常优于使用基线单分辨率语音表示的结果，这似乎证实了我们的主要假设。然而，将多分辨率与高度后处理和说话人自适应特征相结合的结果，这些特征在 Kaldi 中为 TIMIT 提供了最佳结果，仅产生了非常适度的改进。