Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China.
State Key Laboratory of Acoustics, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China.
J Acoust Soc Am. 2023 Jun 1;153(6):3378. doi: 10.1121/10.0019802.
This paper proposes a hybrid neural beamformer for multi-channel speech enhancement, which comprises three stages, i.e., beamforming, post-filtering, and distortion compensation, called TriU-Net. The TriU-Net first estimates a set of masks to be used within a minimum variance distortionless response beamformer. A deep neural network (DNN)-based post-filter is then utilized to suppress the residual noise. Finally, a DNN-based distortion compensator is followed to further improve speech quality. To characterize the long-range temporal dependencies more efficiently, a network topology, gated convolutional attention network, is proposed and utilized in the TriU-Net. The advantage of the proposed model is that the speech distortion compensation is explicitly considered, yielding higher speech quality and intelligibility. The proposed model achieved an average 2.854 wb-PESQ score and 92.57% ESTOI on the CHiME-3 dataset. In addition, extensive experiments conducted on the synthetic data and real recordings confirm the effectiveness of the proposed method in noisy reverberant environments.
本文提出了一种用于多通道语音增强的混合神经波束形成器,它由三个阶段组成,即波束形成、后滤波和失真补偿,称为 TriU-Net。TriU-Net 首先估计一组用于最小方差无失真响应波束形成器中的掩模。然后利用基于深度神经网络 (DNN) 的后滤波器来抑制残余噪声。最后,跟随一个基于 DNN 的失真补偿器来进一步提高语音质量。为了更有效地描述长程时间依赖性,提出并在 TriU-Net 中使用了一种网络拓扑结构,门控卷积注意网络。所提出模型的优点在于明确考虑了语音失真补偿,从而获得更高的语音质量和可懂度。在所提出的模型在 CHiME-3 数据集上实现了平均 2.854 wb-PESQ 得分和 92.57%的 ESTOI。此外,在合成数据和真实录音上进行的广泛实验证实了该方法在噪声混响环境中的有效性。