基于波分裂网络的移动扬声器在线双耳语音分离

ONLINE BINAURAL SPEECH SEPARATION OF MOVING SPEAKERS WITH A WAVESPLIT NETWORK.

作者信息

Han Cong, Mesgarani Nima

机构信息

Department of Electrical Engineering, Columbia University, New York, NY.

出版信息

Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10095695. Epub 2023 May 5.

DOI:10.1109/icassp49357.2023.10095695

PMID:37577180

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10417534/

Abstract

Binaural speech separation in real-world scenarios often involves moving speakers. Most current speech separation methods use utterance-level permutation invariant training (u-PIT) for training. In inference time, however, the order of outputs can be inconsistent over time particularly in long-form speech separation. This situation which is referred to as the speaker swap problem is even more problematic when speakers constantly move in space and therefore poses a challenge for consistent placement of speakers in output channels. Here, we describe a real-time binaural speech separation model based on a Wavesplit network to mitigate the speaker swap problem for moving speaker separation. Our model computes a speaker embedding for each speaker at each time frame from the mixed audio, aggregates embeddings using online clustering, and uses cluster centroids as speaker profiles to track each speaker throughout the long duration. Experimental results on reverberant, long-form moving multitalker speech separation show that the proposed method is less prone to speaker swap and achieves comparable performance with u-PIT based models with ground truth tracking in both separation accuracy and preserving the interaural cues.

摘要

在现实场景中的双耳语音分离通常涉及移动的说话者。当前大多数语音分离方法在训练时使用 utterance-level 排列不变训练（u-PIT）。然而，在推理阶段，输出顺序可能会随时间不一致，特别是在长语音分离中。这种被称为说话者交换问题的情况，当说话者在空间中不断移动时会更成问题，因此对在输出通道中一致地放置说话者构成了挑战。在此，我们描述了一种基于 Wavesplit 网络的实时双耳语音分离模型，以减轻用于移动说话者分离的说话者交换问题。我们的模型从混合音频中为每个说话者在每个时间帧计算一个说话者嵌入，使用在线聚类聚合嵌入，并使用聚类中心作为说话者轮廓来在长时间内跟踪每个说话者。在有混响的长时移动多说话者语音分离上的实验结果表明，所提出的方法不太容易出现说话者交换，并且在分离精度和保留双耳线索方面与基于 u-PIT 且有真实值跟踪的模型具有可比的性能。

相似文献

ONLINE BINAURAL SPEECH SEPARATION OF MOVING SPEAKERS WITH A WAVESPLIT NETWORK.基于波分裂网络的移动扬声器在线双耳语音分离

Proc IEEE Int Conf Acoust Speech Signal Process. 2023 Jun;2023. doi: 10.1109/icassp49357.2023.10095695. Epub 2023 May 5.

A dual-stream deep attractor network with multi-domain learning for speech dereverberation and separation.一种具有多域学习的双流深度吸引子网络用于语音去混响和分离。

Neural Netw. 2021 Sep;141:238-248. doi: 10.1016/j.neunet.2021.04.023. Epub 2021 Apr 21.

Speaker separation in realistic noise environments with applications to a cognitively-controlled hearing aid.在现实噪声环境中的说话人分离及其在认知控制助听器中的应用。

Neural Netw. 2021 Aug;140:136-147. doi: 10.1016/j.neunet.2021.02.020. Epub 2021 Mar 4.

A two-stage deep learning algorithm for talker-independent speaker separation in reverberant conditions.一种用于混响条件下说话人无关说话人分离的两阶段深度学习算法。

J Acoust Soc Am. 2020 Sep;148(3):1157. doi: 10.1121/10.0001779.

Attentive Training: A New Training Framework for Speech Enhancement.注意力训练：一种用于语音增强的新训练框架。

IEEE/ACM Trans Audio Speech Lang Process. 2023;31:1360-1370. doi: 10.1109/taslp.2023.3260711. Epub 2023 Mar 23.

Deep Learning for Talker-dependent Reverberant Speaker Separation: An Empirical Study.基于深度学习的说话人相关混响语音分离实证研究

IEEE/ACM Trans Audio Speech Lang Process. 2019 Nov;27(11):1839-1848. doi: 10.1109/taslp.2019.2934319. Epub 2019 Aug 12.

Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment.用于基于智能手机的被动语音评估的隐私保护深度语音分离

IEEE Open J Eng Med Biol. 2021 Mar 4;2:304-313. doi: 10.1109/OJEMB.2021.3063994. eCollection 2021.

Intelligibility for Binaural Speech with Discarded Low-SNR Speech Components.带有丢弃低信噪比语音成分的双耳语音可懂度。

Adv Exp Med Biol. 2016;894:73-81. doi: 10.1007/978-3-319-25474-6_9.

Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation.分而治之：一种用于独立于说话者的单声道语音分离的深度CASA方法。

IEEE/ACM Trans Audio Speech Lang Process. 2019;27(12):2092-2102. doi: 10.1109/taslp.2019.2941148. Epub 2019 Sep 12.

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.卷积时域音频分离网络（Conv-TasNet）：超越理想时频幅度掩蔽的语音分离方法

IEEE/ACM Trans Audio Speech Lang Process. 2019 Aug;27(8):1256-1266. doi: 10.1109/TASLP.2019.2915167. Epub 2019 May 6.

引用本文的文献

Brain-Controlled Augmented Hearing for Spatially Moving Conversations in Multi-Talker Environments.脑控增强听觉：在多说话人环境中对空间移动对话的增强

Adv Sci (Weinh). 2024 Nov;11(41):e2401379. doi: 10.1002/advs.202401379. Epub 2024 Sep 9.

本文引用的文献

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.卷积时域音频分离网络（Conv-TasNet）：超越理想时频幅度掩蔽的语音分离方法

IEEE/ACM Trans Audio Speech Lang Process. 2019 Aug;27(8):1256-1266. doi: 10.1109/TASLP.2019.2915167. Epub 2019 May 6.

Speaker-independent auditory attention decoding without access to clean speech sources.无需访问干净语音源的说话人无关听觉注意力解码。

Sci Adv. 2019 May 15;5(5):eaav6134. doi: 10.1126/sciadv.aav6134. eCollection 2019 May.

DEEP CLUSTERING AND CONVENTIONAL NETWORKS FOR MUSIC SEPARATION: STRONGER TOGETHER.用于音乐分离的深度聚类与传统网络：携手共进，力量更强。

Proc IEEE Int Conf Acoust Speech Signal Process. 2017 Mar;2017:61-65. doi: 10.1109/ICASSP.2017.7952118. Epub 2017 Jun 19.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验