Suppr超能文献

论监督语音分离的训练目标

On Training Targets for Supervised Speech Separation.

作者信息

Wang Yuxuan, Narayanan Arun, Wang DeLiang

机构信息

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA.

Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA.

出版信息

IEEE/ACM Trans Audio Speech Lang Process. 2014 Dec;22(12):1849-1858. doi: 10.1109/TASLP.2014.2352935.

Abstract

Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

摘要

将语音分离表述为一个监督学习问题已显示出相当大的前景。在其最简单的形式中,一种监督学习算法,通常是深度神经网络,被训练来学习从有噪声特征到感兴趣目标的时频表示的映射。传统上,理想二值掩模(IBM)因其简单性和较大的语音可懂度提升而被用作目标。然而,监督学习框架并不局限于使用二值目标。在本研究中,我们通过使用不同的训练目标来评估和比较分离结果,这些目标包括IBM、目标二值掩模、理想比率掩模(IRM)、短时傅里叶变换频谱幅度及其相应掩模(FFT - MASK)以及伽马通频率功率谱。我们在各种测试条件下的结果表明,两个比率掩模目标,即IRM和FFT - MASK,在客观可懂度和质量指标方面优于其他目标。此外,我们发现一般来说,基于掩蔽的目标明显优于基于频谱包络的目标。我们还与非负矩阵分解和语音增强方面的最新方法进行了比较,这显示了监督语音分离具有明显的性能优势。

相似文献

1
On Training Targets for Supervised Speech Separation.论监督语音分离的训练目标
IEEE/ACM Trans Audio Speech Lang Process. 2014 Dec;22(12):1849-1858. doi: 10.1109/TASLP.2014.2352935.
3
Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising.复域中的时频掩蔽用于语音去混响和降噪
IEEE/ACM Trans Audio Speech Lang Process. 2017 Jul;25(7):1492-1501. doi: 10.1109/TASLP.2017.2696307. Epub 2017 Apr 20.
4
Complex Ratio Masking for Monaural Speech Separation.用于单声道语音分离的复比掩蔽
IEEE/ACM Trans Audio Speech Lang Process. 2016 Mar;24(3):483-492. doi: 10.1109/TASLP.2015.2512042. Epub 2015 Dec 23.
5
Supervised Speech Separation Based on Deep Learning: An Overview.基于深度学习的监督语音分离:综述
IEEE/ACM Trans Audio Speech Lang Process. 2018 Oct;26(10):1702-1726. doi: 10.1109/TASLP.2018.2842159. Epub 2018 May 30.
6
Noise Perturbation for Supervised Speech Separation.用于监督语音分离的噪声扰动
Speech Commun. 2016 Apr 1;78:1-10. doi: 10.1016/j.specom.2015.12.006.

引用本文的文献

1
CROSS-DOMAIN DIFFUSION BASED SPEECH ENHANCEMENT FOR VERY NOISY SPEECH.基于跨域扩散的极嘈杂语音增强
Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10096985. Epub 2023 May 5.
2
CROSS-DOMAIN SPEECH ENHANCEMENT WITH A NEURAL CASCADE ARCHITECTURE.基于神经级联架构的跨域语音增强
Proc IEEE Int Conf Acoust Speech Signal Process. 2022 May;2022:7862-7866. doi: 10.1109/icassp43922.2022.9747752. Epub 2022 Apr 27.

本文引用的文献

3
A classification based approach to speech segregation.基于分类的语音分离方法。
J Acoust Soc Am. 2012 Nov;132(5):3475-83. doi: 10.1121/1.4754541.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验