Williamson Donald S, Wang Yuxuan, Wang DeLiang
Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA.
Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA.
J Acoust Soc Am. 2015 Sep;138(3):1399-407. doi: 10.1121/1.4928612.
As a means of speech separation, time-frequency masking applies a gain function to the time-frequency representation of noisy speech. On the other hand, nonnegative matrix factorization (NMF) addresses separation by linearly combining basis vectors from speech and noise models to approximate noisy speech. This paper presents an approach for improving the perceptual quality of speech separated from background noise at low signal-to-noise ratios. An ideal ratio mask is estimated, which separates speech from noise with reasonable sound quality. A deep neural network then approximates clean speech by estimating activation weights from the ratio-masked speech, where the weights linearly combine elements from a NMF speech model. Systematic comparisons using objective metrics, including the perceptual evaluation of speech quality, show that the proposed algorithm achieves higher speech quality than related masking and NMF methods. In addition, a listening test was performed and its results show that the output of the proposed algorithm is preferred over the comparison systems in terms of speech quality.
作为一种语音分离方法,时频掩蔽将增益函数应用于带噪语音的时频表示。另一方面,非负矩阵分解(NMF)通过线性组合语音模型和噪声模型的基向量来近似带噪语音,从而解决分离问题。本文提出了一种在低信噪比情况下提高从背景噪声中分离出的语音感知质量的方法。估计一个理想的比率掩蔽,它以合理的音质将语音与噪声分离。然后,深度神经网络通过从比率掩蔽语音中估计激活权重来近似纯净语音,其中权重线性组合来自非负矩阵分解语音模型的元素。使用包括语音质量感知评估在内的客观指标进行的系统比较表明,所提出的算法比相关的掩蔽和非负矩阵分解方法具有更高的语音质量。此外,进行了听力测试,其结果表明,在所提出算法的输出在语音质量方面比比较系统更受青睐。