Pandey Ashutosh, Wang DeLiang
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA.
Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA.
IEEE/ACM Trans Audio Speech Lang Process. 2020;28:2489-2499. doi: 10.1109/taslp.2020.3016487. Epub 2020 Aug 14.
In recent years, supervised approaches using deep neural networks (DNNs) have become the mainstream for speech enhancement. It has been established that DNNs generalize well to untrained noises and speakers if trained using a large number of noises and speakers. However, we find that DNNs fail to generalize to new speech corpora in low signal-to-noise ratio (SNR) conditions. In this work, we establish that the lack of generalization is mainly due to the channel mismatch, i.e. different recording conditions between the trained and untrained corpus. Additionally, we observe that traditional channel normalization techniques are not effective in improving cross-corpus generalization. Further, we evaluate publicly available datasets that are promising for generalization. We find one particular corpus to be significantly better than others. Finally, we find that using a smaller frame shift in short-time processing of speech can significantly improve cross-corpus generalization. The proposed techniques to address cross-corpus generalization include channel normalization, better training corpus, and smaller frame shift in short-time Fourier transform (STFT). These techniques together improve the objective intelligibility and quality scores on untrained corpora significantly.
近年来,使用深度神经网络(DNN)的监督方法已成为语音增强的主流。已经证实,如果使用大量噪声和说话者进行训练,DNN能够很好地推广到未训练的噪声和说话者。然而,我们发现DNN在低信噪比(SNR)条件下无法推广到新的语音语料库。在这项工作中,我们确定缺乏泛化能力主要是由于通道失配,即训练语料库和未训练语料库之间的不同录音条件。此外,我们观察到传统的通道归一化技术在改善跨语料库泛化方面并不有效。此外,我们评估了有望实现泛化的公开可用数据集。我们发现一个特定的语料库明显优于其他语料库。最后,我们发现在语音的短时处理中使用较小的帧移可以显著提高跨语料库的泛化能力。所提出的解决跨语料库泛化的技术包括通道归一化、更好的训练语料库以及短时傅里叶变换(STFT)中较小的帧移。这些技术共同显著提高了未训练语料库上的客观可懂度和质量得分。