Suppr超能文献

用于监督语音分离中说话人泛化的长短期记忆网络

Long short-term memory for speaker generalization in supervised speech separation.

作者信息

Chen Jitong, Wang DeLiang

机构信息

Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA.

出版信息

J Acoust Soc Am. 2017 Jun;141(6):4705. doi: 10.1121/1.4986931.

Abstract

Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. To improve speaker generalization, a separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech. Systematic evaluation shows that the proposed model substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Analyzing LSTM internal representations reveals that LSTM captures long-term speech contexts. It is also found that the LSTM model is more advantageous for low-latency speech separation and it, without future frames, performs better than the DNN model with future frames. The proposed model represents an effective approach for speaker- and noise-independent speech separation.

摘要

语音分离可以被表述为学习从有噪声语音中提取的声学特征来估计时频掩码。对于有监督的语音分离,泛化到未见过的噪声和未见过的说话者是一个关键问题。尽管深度神经网络(DNN)在与噪声无关的语音分离方面取得了成功,但DNN在对大量说话者进行建模时存在局限性。为了提高说话者泛化能力,提出了一种基于长短期记忆(LSTM)的分离模型,该模型自然地考虑了语音的时间动态。系统评估表明,在客观语音可懂度方面,所提出的模型在未见过的说话者和未见过的噪声上显著优于基于DNN的模型。对LSTM内部表示的分析表明,LSTM捕捉到了长期语音上下文。还发现LSTM模型在低延迟语音分离方面更具优势,并且在没有未来帧的情况下,其性能优于有未来帧的DNN模型。所提出的模型代表了一种用于与说话者和噪声无关的语音分离的有效方法。

相似文献

10
Issues in forensic voice.法医语音学中的问题。
J Voice. 2014 Mar;28(2):170-84. doi: 10.1016/j.jvoice.2013.06.011. Epub 2013 Oct 28.

引用本文的文献

6
Attentive Training: A New Training Framework for Speech Enhancement.注意力训练:一种用于语音增强的新训练框架。
IEEE/ACM Trans Audio Speech Lang Process. 2023;31:1360-1370. doi: 10.1109/taslp.2023.3260711. Epub 2023 Mar 23.
7
Speech extraction from vibration signals based on deep learning.基于深度学习的振动信号语音提取。
PLoS One. 2023 Oct 25;18(10):e0288847. doi: 10.1371/journal.pone.0288847. eCollection 2023.

本文引用的文献

2
Noise Perturbation for Supervised Speech Separation.用于监督语音分离的噪声扰动
Speech Commun. 2016 Apr 1;78:1-10. doi: 10.1016/j.specom.2015.12.006.
4
On Training Targets for Supervised Speech Separation.论监督语音分离的训练目标
IEEE/ACM Trans Audio Speech Lang Process. 2014 Dec;22(12):1849-1858. doi: 10.1109/TASLP.2014.2352935.
9
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验