Yang Yufeng, Pandey Ashutosh, Wang DeLiang
Department of Computer Science and Engineering, The Ohio State University, USA.
Center for Cognitive and Brain Sciences, The Ohio State University, USA.
Interspeech. 2023 Aug;2023:4913-4917. doi: 10.21437/interspeech.2023-167.
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement algorithms. However, speech enhancement has not been established as an effective frontend for robust automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between speech enhancement and ASR impedes the progress of robust ASR systems especially as speech enhancement has made big strides in recent years. In this work, we focus on eliminating this divide with an ARN (attentive recurrent network) based time-domain enhancement model. The proposed system fully decouples speech enhancement and an acoustic model trained only on clean speech. Results on the CHiME-2 corpus show that ARN enhanced speech translates to improved ASR results. The proposed system achieves 6.28% average word error rate, outperforming the previous best by 19.3% relatively.
研究表明,语音增强算法可以提高噪声环境下语音的清晰度。然而,与直接在噪声语音上训练的自动语音识别(ASR)模型相比,语音增强尚未成为在噪声条件下实现鲁棒自动语音识别的有效前端。语音增强和ASR之间的脱节阻碍了鲁棒ASR系统的发展,特别是近年来语音增强已经取得了长足的进步。在这项工作中,我们专注于用基于注意力循环网络(ARN)的时域增强模型消除这种脱节。所提出的系统将语音增强与仅在干净语音上训练的声学模型完全解耦。在CHiME-2语料库上的结果表明,ARN增强的语音转化为了更好的ASR结果。所提出的系统实现了6.28%的平均字错误率,相对比之前的最佳结果提高了19.3%。