Department of Computer Science and Engineering and Center for Cognitive Science, The Ohio State University, Columbus, Ohio 43210, USA.
J Acoust Soc Am. 2013 May;133(5):3083-93. doi: 10.1121/1.4798661.
Processing noisy signals using the ideal binary mask improves automatic speech recognition (ASR) performance. This paper presents the first study that investigates the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs), and vocabulary sizes. Binary masks are computed either by comparing the SNR within a time-frequency unit of a mixture signal with a local criterion (LC), or by comparing the local target energy with the long-term average spectral energy of speech. ASR results show that (1) akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB; (2) the ASR performance profiles are qualitatively similar to those obtained in human intelligibility experiments; (3) the difference between the LC and mixture SNR is more correlated to the recognition accuracy than LC; (4) LC at which the performance peaks is lower than 0 dB, which is the threshold that maximizes the SNR gain of processed signals. This broad agreement with human performance is rather surprising. The results also indicate that maximizing the SNR gain is probably not an appropriate goal for improving either human or machine recognition of noisy speech.
使用理想二进制掩蔽处理噪声信号可以提高自动语音识别 (ASR) 的性能。本文首次研究了在不同噪声、信噪比 (SNR) 和词汇量下,二进制掩蔽模式在 ASR 中的作用。二进制掩蔽可以通过将混合信号的时频单元内的 SNR 与局部准则 (LC) 进行比较,或者通过将局部目标能量与语音的长期平均谱能量进行比较来计算。ASR 结果表明:(1)与人类语音识别类似,即使 SNR 低至-60dB,二进制掩蔽也能显著提高 ASR 性能;(2)ASR 性能曲线与人类可懂度实验获得的结果定性相似;(3)性能峰值处的 LC 与 SNR 的差异与识别准确性的相关性高于 LC;(4)性能峰值处的 LC 低于 0dB,这是处理后信号的 SNR 增益最大化的阈值。这与人类表现的广泛一致性令人惊讶。结果还表明,最大化 SNR 增益可能不是提高人类或机器对噪声语音识别的合适目标。