Institute of Automatic Control and Robotics, Poznan University of Technology, 60-965 Poznan, Poland.
Sensors (Basel). 2022 Mar 22;22(7):2440. doi: 10.3390/s22072440.
Monaural speech enhancement aims to remove background noise from an audio recording containing speech in order to improve its clarity and intelligibility. Currently, the most successful solutions for speech enhancement use deep neural networks. In a typical setting, such neural networks process the noisy input signal once and produces a single enhanced signal. However, it was recently shown that a U-Net-based network can be trained in such a way that allows it to process the same input signal multiple times in order to enhance the speech even further. Unfortunately, this was tested only for two-iteration enhancement. In the current research, we extend previous efforts and demonstrate how the multi-forward-pass speech enhancement can be successfully applied to other architectures, namely the ResBLSTM and Transformer-Net. Moreover, we test the three architectures with up to five iterations, thus identifying the method's limit in terms of performance gain. In our experiments, we used the audio samples from the WSJ0, Noisex-92, and DCASE datasets and measured speech enhancement quality using SI-SDR, STOI, and PESQ. The results show that performing speech enhancement up to five times still brings improvements to speech intelligibility, but the gain becomes smaller with each iteration. Nevertheless, performing five iterations instead of two gives additional a 0.6 dB SI-SDR and four-percentage-point STOI gain. However, these increments are not equal between different architectures, and the U-Net and Transformer-Net benefit more from multi-forward pass compared to ResBLSTM.
单声道语音增强旨在从包含语音的音频记录中去除背景噪声,以提高其清晰度和可理解性。目前,用于语音增强的最成功的解决方案是使用深度神经网络。在典型的设置中,这种神经网络会对有噪声的输入信号进行一次处理,并生成一个增强后的信号。然而,最近有人表明,基于 U-Net 的网络可以经过训练,使其能够多次处理相同的输入信号,从而进一步增强语音。不幸的是,这仅在两次增强的情况下进行了测试。在当前的研究中,我们扩展了以前的工作,并展示了多前向传递语音增强如何成功应用于其他架构,即 ResBLSTM 和 Transformer-Net。此外,我们使用多达五个迭代来测试这三种架构,从而确定该方法在性能增益方面的限制。在我们的实验中,我们使用了来自 WSJ0、Noisex-92 和 DCASE 数据集的音频样本,并使用 SI-SDR、STOI 和 PESQ 来衡量语音增强质量。结果表明,进行多达五次的语音增强仍然可以提高语音的可理解性,但每次迭代的增益都会变小。然而,进行五次迭代而不是两次迭代可以额外获得 0.6dB 的 SI-SDR 和四个百分点的 STOI 增益。然而,这些增量在不同的架构之间并不相等,与 ResBLSTM 相比,U-Net 和 Transformer-Net 从多前向传递中获益更多。