Pandey Ashutosh, Wang DeLiang
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA.
Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA.
IEEE/ACM Trans Audio Speech Lang Process. 2021;29:1270-1279. doi: 10.1109/taslp.2021.3064421. Epub 2021 Mar 8.
Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the decoder comprises a dense block and an attention module. Dense blocks and attention modules help in feature extraction using a combination of feature reuse, increased network depth, and maximum context aggregation. Furthermore, we reveal previously unknown problems with a loss based on the spectral magnitude of enhanced speech. To alleviate these problems, we propose a novel loss based on magnitudes of enhanced speech and a predicted noise. Even though the proposed loss is based on magnitudes only, a constraint imposed by noise prediction ensures that the loss enhances both magnitude and phase. Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.
近年来,时域语音增强因其能够同时增强语音的幅度和相位而越来越受欢迎。在这项工作中,我们提出了一种带有自注意力机制的密集卷积网络(DCN)用于时域语音增强。DCN是一种基于编码器和解码器且带有跳跃连接的架构。编码器和解码器中的每一层都由一个密集块和一个注意力模块组成。密集块和注意力模块通过特征重用、增加网络深度和最大上下文聚合的组合来帮助进行特征提取。此外,我们揭示了基于增强语音频谱幅度的损失存在的先前未知问题。为了缓解这些问题,我们提出了一种基于增强语音幅度和预测噪声的新型损失。尽管所提出的损失仅基于幅度,但噪声预测施加的约束确保了该损失同时增强幅度和相位。实验结果表明,使用所提出的损失进行训练的DCN在因果和非因果语音增强方面显著优于其他现有技术方法。