Abdulbaqi Jalal, Gu Yue, Chen Shuhong, Marsic Ivan
Rutgers, the State University of New Jersey, USA.
Proc IEEE Int Conf Acoust Speech Signal Process. 2020 May;2020:6659-6663. doi: 10.1109/icassp40776.2020.9053544. Epub 2020 May 14.
Most current speech enhancement models use spectrogram features that require an expensive transformation and result in phase information loss. Previous work has overcome these issues by using convolutional networks to learn the temporal correlations across high-resolution waveforms. These models, however, are limited by memory-intensive dilated convolution and aliasing artifacts from upsampling. We introduce an end-to-end fully recurrent neural network for single-channel speech enhancement. The network structured as an hourglass-shape that can efficiently capture long-range temporal dependencies by reducing the features resolution without information loss. Also, we use residual connections to prevent gradient decay over layers and improve the model generalization. Experimental results show that our model outperforms state-of-the-art approaches in six quantitative evaluation metrics.
当前大多数语音增强模型使用频谱图特征,这需要昂贵的变换并且会导致相位信息丢失。先前的工作通过使用卷积网络来学习高分辨率波形之间的时间相关性,克服了这些问题。然而,这些模型受到内存密集型扩张卷积和上采样产生的混叠伪影的限制。我们引入了一种用于单通道语音增强的端到端全循环神经网络。该网络结构呈沙漏形,通过降低特征分辨率而不损失信息,能够有效地捕捉长期时间依赖性。此外,我们使用残差连接来防止梯度在各层之间衰减,并提高模型的泛化能力。实验结果表明,我们的模型在六个定量评估指标上优于现有最先进的方法。