• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

卷积时域音频分离网络(Conv-TasNet):超越理想时频幅度掩蔽的语音分离方法

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

作者信息

Luo Yi, Mesgarani Nima

出版信息

IEEE/ACM Trans Audio Speech Lang Process. 2019 Aug;27(8):1256-1266. doi: 10.1109/TASLP.2019.2915167. Epub 2019 May 6.

DOI:10.1109/TASLP.2019.2915167
PMID:31485462
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6726126/
Abstract

Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency of the entire system. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a much shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.

摘要

单通道、与说话者无关的语音分离方法近来取得了巨大进展。然而,此类方法的准确性、延迟和计算成本仍显不足。大多数先前的方法通过混合信号的时频表示来构建分离问题,这存在若干缺点,包括信号的相位和幅度解耦、用于语音分离的时频表示的次优性以及整个系统的长延迟。为解决这些缺点,我们提出了一种全卷积时域音频分离网络(Conv-TasNet),这是一种用于端到端时域语音分离的深度学习框架。Conv-TasNet使用线性编码器来生成针对分离各个说话者而优化的语音波形表示。通过对编码器输出应用一组加权函数(掩码)来实现说话者分离。然后使用线性解码器将修改后的编码器表示转换回波形。使用由堆叠的一维扩张卷积块组成的时域卷积网络(TCN)来找到掩码,这使得网络能够在保持较小模型规模的同时对语音信号的长期依赖性进行建模。所提出的Conv-TasNet系统在分离两说话者和三说话者混合语音方面显著优于先前的时频掩码方法。此外,通过客观失真度量和人类听众的主观质量评估,Conv-TasNet在两说话者语音分离方面超过了几种理想的时频幅度掩码。最后,Conv-TasNet具有显著更小的模型规模和更短的最小延迟,使其成为离线和实时语音分离应用的合适解决方案。因此,本研究朝着实现用于现实世界语音处理技术的语音分离系统迈出了重要一步。

相似文献

1
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.卷积时域音频分离网络(Conv-TasNet):超越理想时频幅度掩蔽的语音分离方法
IEEE/ACM Trans Audio Speech Lang Process. 2019 Aug;27(8):1256-1266. doi: 10.1109/TASLP.2019.2915167. Epub 2019 May 6.
2
A dual-stream deep attractor network with multi-domain learning for speech dereverberation and separation.一种具有多域学习的双流深度吸引子网络用于语音去混响和分离。
Neural Netw. 2021 Sep;141:238-248. doi: 10.1016/j.neunet.2021.04.023. Epub 2021 Apr 21.
3
Single-Channel Blind Source Separation of Spatial Aliasing Signal Based on Stacked-LSTM.基于堆叠长短期记忆网络的空间混叠信号单通道盲分离。
Sensors (Basel). 2021 Jul 16;21(14):4844. doi: 10.3390/s21144844.
4
Speaker separation in realistic noise environments with applications to a cognitively-controlled hearing aid.在现实噪声环境中的说话人分离及其在认知控制助听器中的应用。
Neural Netw. 2021 Aug;140:136-147. doi: 10.1016/j.neunet.2021.02.020. Epub 2021 Mar 4.
5
Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment.用于基于智能手机的被动语音评估的隐私保护深度语音分离
IEEE Open J Eng Med Biol. 2021 Mar 4;2:304-313. doi: 10.1109/OJEMB.2021.3063994. eCollection 2021.
6
Attention-Based Joint Training of Noise Suppression and Sound Event Detection for Noise-Robust Classification.基于注意力的噪声抑制与声音事件检测联合训练用于抗噪分类。
Sensors (Basel). 2021 Oct 9;21(20):6718. doi: 10.3390/s21206718.
7
NeoSSNet: Real-Time Neonatal Chest Sound Separation Using Deep Learning.NeoSSNet:使用深度学习的实时新生儿胸部声音分离
IEEE Open J Eng Med Biol. 2024 May 15;5:345-352. doi: 10.1109/OJEMB.2024.3401571. eCollection 2024.
8
Noise-robust voice conversion with domain adversarial training.基于域对抗训练的抗噪语音转换。
Neural Netw. 2022 Apr;148:74-84. doi: 10.1016/j.neunet.2022.01.003. Epub 2022 Jan 13.
9
Deep Learning for Talker-dependent Reverberant Speaker Separation: An Empirical Study.基于深度学习的说话人相关混响语音分离实证研究
IEEE/ACM Trans Audio Speech Lang Process. 2019 Nov;27(11):1839-1848. doi: 10.1109/taslp.2019.2934319. Epub 2019 Aug 12.
10
ONLINE BINAURAL SPEECH SEPARATION OF MOVING SPEAKERS WITH A WAVESPLIT NETWORK.基于波分裂网络的移动扬声器在线双耳语音分离
Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10095695. Epub 2023 May 5.

引用本文的文献

1
Deflationary Extraction Transformer for Speech Separation with Unknown Number of Talkers.用于未知说话人数语音分离的紧缩提取变压器
Sensors (Basel). 2025 Aug 8;25(16):4905. doi: 10.3390/s25164905.
2
Evaluation of Speaker-Conditioned Target Speaker Extraction Algorithms for Hearing-Impaired Listeners.针对听力受损听众的说话者条件目标说话者提取算法评估
Trends Hear. 2025 Jan-Dec;29:23312165251365802. doi: 10.1177/23312165251365802. Epub 2025 Aug 11.
3
Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation.

本文引用的文献

1
Supervised Speech Separation Based on Deep Learning: An Overview.基于深度学习的监督语音分离:综述
IEEE/ACM Trans Audio Speech Lang Process. 2018 Oct;26(10):1702-1726. doi: 10.1109/TASLP.2018.2842159. Epub 2018 May 30.
2
DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION.用于单麦克风扬声器分离的深度吸引子网络
Proc IEEE Int Conf Acoust Speech Signal Process. 2017 Mar;2017:246-250. doi: 10.1109/ICASSP.2017.7952155. Epub 2017 Jun 19.
3
DEEP CLUSTERING AND CONVENTIONAL NETWORKS FOR MUSIC SEPARATION: STRONGER TOGETHER.
通过动态样本丢弃结合语音分离中的逐层优化来改进标签分配学习
Interspeech. 2023 Aug;2023:3492-3496. doi: 10.21437/interspeech.2023-1172.
4
Audio-visual source separation with localization and individual control.具有定位和个体控制功能的视听源分离
PLoS One. 2025 May 23;20(5):e0321856. doi: 10.1371/journal.pone.0321856. eCollection 2025.
5
Three-stage hybrid spiking neural networks fine-tuning for speech enhancement.用于语音增强的三阶段混合脉冲神经网络微调
Front Neurosci. 2025 Apr 30;19:1567347. doi: 10.3389/fnins.2025.1567347. eCollection 2025.
6
Effective Acoustic Model-Based Beamforming Training for Static and Dynamic Hri Applications.基于有效声学模型的波束成形训练,用于静态和动态 HRi 应用。
Sensors (Basel). 2024 Oct 15;24(20):6644. doi: 10.3390/s24206644.
7
Semiautomated generation of species-specific training data from large, unlabeled acoustic datasets for deep supervised birdsong isolation.从大型未标记声学数据集自动生成特定物种训练数据,用于深度监督鸟鸣分离。
PeerJ. 2024 Sep 23;12:e17854. doi: 10.7717/peerj.17854. eCollection 2024.
8
Brain-Controlled Augmented Hearing for Spatially Moving Conversations in Multi-Talker Environments.脑控增强听觉:在多说话人环境中对空间移动对话的增强
Adv Sci (Weinh). 2024 Nov;11(41):e2401379. doi: 10.1002/advs.202401379. Epub 2024 Sep 9.
9
Analysis and interpretation of joint source separation and sound event detection in domestic environments.分析和解释室内环境中的联合声源分离和声音事件检测。
PLoS One. 2024 Jul 5;19(7):e0303994. doi: 10.1371/journal.pone.0303994. eCollection 2024.
10
NeoSSNet: Real-Time Neonatal Chest Sound Separation Using Deep Learning.NeoSSNet:使用深度学习的实时新生儿胸部声音分离
IEEE Open J Eng Med Biol. 2024 May 15;5:345-352. doi: 10.1109/OJEMB.2024.3401571. eCollection 2024.
用于音乐分离的深度聚类与传统网络:携手共进,力量更强。
Proc IEEE Int Conf Acoust Speech Signal Process. 2017 Mar;2017:61-65. doi: 10.1109/ICASSP.2017.7952118. Epub 2017 Jun 19.
4
On Training Targets for Supervised Speech Separation.论监督语音分离的训练目标
IEEE/ACM Trans Audio Speech Lang Process. 2014 Dec;22(12):1849-1858. doi: 10.1109/TASLP.2014.2352935.
5
Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization.基于体最大化的负相关最小成分分析用于相关源的分离。
IEEE Trans Pattern Anal Mach Intell. 2010 May;32(5):875-88. doi: 10.1109/TPAMI.2009.72.
6
Convex and semi-nonnegative matrix factorizations.凸和半非负矩阵分解。
IEEE Trans Pattern Anal Mach Intell. 2010 Jan;32(1):45-55. doi: 10.1109/TPAMI.2008.277.
7
Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers.基频和声道长度变化对同时出现的两个说话者中对其中一个的注意力的影响。
J Acoust Soc Am. 2003 Nov;114(5):2913-22. doi: 10.1121/1.1616924.
8
Blind source separation by sparse decomposition in a signal dictionary.基于信号字典中稀疏分解的盲源分离
Neural Comput. 2001 Apr;13(4):863-82. doi: 10.1162/089976601300014385.
9
Tonotopic organization of the human auditory cortex.人类听觉皮层的音调组织
Science. 1982 Jun 18;216(4552):1339-40. doi: 10.1126/science.7079770.
10
Tonotopic organization of the auditory cortex: pitch versus frequency representation.听觉皮层的音调组织:音高与频率表征
Science. 1989 Oct 27;246(4929):486-8. doi: 10.1126/science.2814476.