Wang Zhong-Qiu, Wang DeLiang
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA.
Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277 USA.
IEEE/ACM Trans Audio Speech Lang Process. 2020;28:941-950. doi: 10.1109/taslp.2020.2975902. Epub 2020 Feb 28.
This study investigates deep learning based single- and multi-channel speech dereverberation. For single-channel processing, we extend magnitude-domain masking and mapping based dereverberation to complex-domain mapping, where deep neural networks (DNNs) are trained to predict the real and imaginary (RI) components of the direct-path signal from reverberant (and noisy) ones. For multi-channel processing, we first compute a minimum variance distortionless response (MVDR) beamformer to cancel the direct-path signal, and then feed the RI components of the cancelled signal, which is expected to be a filtered version of non-target signals, as additional features to perform dereverberation. Trained on a large dataset of simulated room impulse responses, our models show excellent speech dereverberation and recognition performance on the test set of the REVERB challenge, consistently better than single- and multi-channel weighted prediction error (WPE) algorithms.
本研究调查基于深度学习的单通道和多通道语音去混响。对于单通道处理,我们将基于幅度域掩蔽和映射的去混响扩展到复域映射,其中深度神经网络(DNN)经过训练,从混响(和有噪声的)信号中预测直达路径信号的实部和虚部(RI)分量。对于多通道处理,我们首先计算最小方差无失真响应(MVDR)波束形成器以消除直达路径信号,然后将消除后的信号的RI分量(预计为非目标信号的滤波版本)作为额外特征来进行去混响。在大量模拟房间脉冲响应数据集上进行训练后,我们的模型在REVERB挑战赛的测试集上展现出出色的语音去混响和识别性能,始终优于单通道和多通道加权预测误差(WPE)算法。