Schädler Marc René, Kollmeier Birger
Medizinische Physik and Cluster of Excellence Hearing4all, Universität Oldenburg, D-26111 Oldenburg, Germany.
J Acoust Soc Am. 2015 Apr;137(4):2047-59. doi: 10.1121/1.4916618.
To test if simultaneous spectral and temporal processing is required to extract robust features for automatic speech recognition (ASR), the robust spectro-temporal two-dimensional-Gabor filter bank (GBFB) front-end from Schädler, Meyer, and Kollmeier [J. Acoust. Soc. Am. 131, 4134-4151 (2012)] was de-composed into a spectral one-dimensional-Gabor filter bank and a temporal one-dimensional-Gabor filter bank. A feature set that is extracted with these separate spectral and temporal modulation filter banks was introduced, the separate Gabor filter bank (SGBFB) features, and evaluated on the CHiME (Computational Hearing in Multisource Environments) keywords-in-noise recognition task. From the perspective of robust ASR, the results showed that spectral and temporal processing can be performed independently and are not required to interact with each other. Using SGBFB features permitted the signal-to-noise ratio (SNR) to be lowered by 1.2 dB while still performing as well as the GBFB-based reference system, which corresponds to a relative improvement of the word error rate by 12.8%. Additionally, the real time factor of the spectro-temporal processing could be reduced by more than an order of magnitude. Compared to human listeners, the SNR needed to be 13 dB higher when using Mel-frequency cepstral coefficient features, 11 dB higher when using GBFB features, and 9 dB higher when using SGBFB features to achieve the same recognition performance.
为了测试自动语音识别(ASR)是否需要同时进行频谱和时间处理来提取稳健特征,我们将Schädler、Meyer和Kollmeier [《美国声学学会杂志》131, 4134 - 4151 (2012)]提出的稳健的频谱 - 时间二维伽柏滤波器组(GBFB)前端分解为一个频谱一维伽柏滤波器组和一个时间一维伽柏滤波器组。我们引入了用这些单独的频谱和时间调制滤波器组提取的特征集,即单独的伽柏滤波器组(SGBFB)特征,并在CHiME(多源环境中的计算听觉)噪声中的关键词识别任务上进行了评估。从稳健ASR的角度来看,结果表明频谱和时间处理可以独立进行,无需相互作用。使用SGBFB特征可将信噪比(SNR)降低1.2 dB,同时性能仍与基于GBFB的参考系统相当,这对应于单词错误率相对提高12.8%。此外,频谱 - 时间处理的实时因子可降低一个多数量级。与人类听众相比,使用梅尔频率倒谱系数特征时,要达到相同的识别性能,所需的SNR要高13 dB;使用GBFB特征时要高11 dB;使用SGBFB特征时要高9 dB。