Tan Ke, Wang DeLiang
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210-1277 USA.
Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277, USA.
IEEE/ACM Trans Audio Speech Lang Process. 2020;28:380-390. doi: 10.1109/taslp.2019.2955276. Epub 2019 Nov 22.
Phase is important for perceptual quality of speech. However, it seems intractable to directly estimate phase spectra through supervised learning due to their lack of spectrotemporal structure in it. Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of speech. Inspired by multi-task learning, we propose a gated convolutional recurrent network (GCRN) for complex spectral mapping, which amounts to a causal system for monaural speech enhancement. Our experimental results suggest that the proposed GCRN substantially outperforms an existing convolutional neural network (CNN) for complex spectral mapping in terms of both objective speech intelligibility and quality. Moreover, the proposed approach yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking. We also find that complex spectral mapping with the proposed GCRN provides an effective phase estimate.
相位对于语音的感知质量很重要。然而,由于相位谱缺乏频谱时间结构,通过监督学习直接估计相位谱似乎难以解决。复谱映射旨在从带噪语音的实部和虚部谱图估计纯净语音的实部和虚部谱图,这同时增强了语音的幅度和相位响应。受多任务学习的启发,我们提出了一种用于复谱映射的门控卷积循环网络(GCRN),它相当于一个用于单声道语音增强的因果系统。我们的实验结果表明,所提出的GCRN在客观语音清晰度和质量方面都大大优于现有的用于复谱映射的卷积神经网络(CNN)。此外,所提出的方法产生的短时客观可懂度(STOI)和语音质量感知评估(PESQ)显著高于幅度谱映射和复比率掩蔽。我们还发现,使用所提出的GCRN进行复谱映射可提供有效的相位估计。