基于迭代自适应吸引子估计的端到端神经说话人聚类

End-to-end neural speaker diarization with an iterative adaptive attractor estimation.

作者信息

Hao Fengyuan, Li Xiaodong, Zheng Chengshi

机构信息

Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.

出版信息

Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.

DOI:10.1016/j.neunet.2023.07.043

PMID:37586257

Abstract

End-to-end neural diarization (EEND) which has the capability to directly output speaker diarization results and handle overlapping speech has attracted more and more attention due to its promising performance. Although existing EEND-based methods often outperform clustering-based methods, they cannot generalize well to unseen test sets because fixed attractors are often utilized to estimate speech activities of each speaker. An iterative adaptive attractor estimation (IAAE) network was proposed to refine diarization results, in which the self-attentive EEND (SA-EEND) was implemented to initialize diarization results and frame-wise embeddings. There are two main parts in the proposed IAAE network: an attention-based pooling was designed to obtain a rough estimation of the attractors based on the diarization results of the previous iteration, and an adaptive attractor was then calculated by using transformer decoder blocks. A unified training framework was proposed to further improve the diarization performance, making the embeddings more discriminable based on the well separated attractors. We evaluated the proposed method on both the simulated mixtures and the real CALLHOME dataset using the diarization error rate (DER). Our proposed method provides relative reductions in DER by up to 44.8% on simulated 2-speaker mixtures and 23.6% on the CALLHOME dataset over the baseline SA-EEND at the 2nd iteration step. We also demonstrated that with an increasing number of refinement steps applied, the DER on the CALLHOME dataset could be further reduced to 7.36%, achieving the state-of-the-art diarization results when compared with other methods.

摘要

端到端神经语音分离（EEND）能够直接输出说话人语音分离结果并处理重叠语音，因其出色的性能而受到越来越多的关注。尽管现有的基于EEND的方法通常优于基于聚类的方法，但由于经常使用固定吸引子来估计每个说话人的语音活动，它们无法很好地推广到未见测试集。提出了一种迭代自适应吸引子估计（IAAE）网络来优化语音分离结果，其中实现了自注意力EEND（SA-EEND）来初始化语音分离结果和逐帧嵌入。所提出的IAAE网络有两个主要部分：设计了基于注意力的池化，以根据上一次迭代的语音分离结果获得吸引子的粗略估计，然后使用Transformer解码器块计算自适应吸引子。提出了一个统一的训练框架来进一步提高语音分离性能，使嵌入基于分离良好的吸引子更具辨别力。我们使用语音分离错误率（DER）在模拟混合语音和真实CALLHOME数据集上评估了所提出的方法。在第二次迭代步骤中，我们提出的方法在模拟的双说话人混合语音上使DER相对降低了44.8%，在CALLHOME数据集上相对降低了23.6%。我们还证明，随着应用的细化步骤数量增加，CALLHOME数据集上的DER可以进一步降低到7.36%，与其他方法相比达到了当前最优的语音分离结果。

相似文献

End-to-end neural speaker diarization with an iterative adaptive attractor estimation.基于迭代自适应吸引子估计的端到端神经说话人聚类

Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.

Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization.生成对抗网络中用于说话人聚类的潜在空间聚类元学习

IEEE/ACM Trans Audio Speech Lang Process. 2021;29:1204-1219. doi: 10.1109/taslp.2021.3061885. Epub 2021 Feb 26.

The Impact of Speaker Diarization on DNN-based Autism Severity Estimation.说话人分段对基于 DNN 的自闭症严重程度估计的影响。

Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul;2022:3414-3417. doi: 10.1109/EMBC48229.2022.9871523.

Speaker-turn aware diarization for speech-based cognitive assessments.用于基于语音的认知评估的说话轮次感知语音分离

Front Neurosci. 2024 Jan 16;17:1351848. doi: 10.3389/fnins.2023.1351848. eCollection 2023.

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.基于预训练的视听同步模型的多模态说话人分割。

Sensors (Basel). 2019 Nov 25;19(23):5163. doi: 10.3390/s19235163.

Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation.基于Whisper分割的实时多语言语音识别与说话人识别系统。

PeerJ Comput Sci. 2024 Mar 29;10:e1973. doi: 10.7717/peerj-cs.1973. eCollection 2024.

Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.用于无监督时空说话人分离的多感官融合

Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.基于时空贝叶斯融合的视听说话人定界

IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.

Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.使用深度神经嵌入的半监督音频驱动电视新闻说话人分割

J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.

Evaluation of Deep Clustering for Diarization of Aphasic Speech.用于失语症语音分离的深度聚类评估

Stud Health Technol Inform. 2019;260:81-88.

基于迭代自适应吸引子估计的端到端神经说话人聚类

End-to-end neural speaker diarization with an iterative adaptive attractor estimation.

作者信息

Hao Fengyuan, Li Xiaodong, Zheng Chengshi

机构信息

Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.

出版信息

Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.

DOI:10.1016/j.neunet.2023.07.043

PMID:37586257

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于迭代自适应吸引子估计的端到端神经说话人聚类

End-to-end neural speaker diarization with an iterative adaptive attractor estimation.

作者信息

机构信息

出版信息

相似文献

基于迭代自适应吸引子估计的端到端神经说话人聚类

End-to-end neural speaker diarization with an iterative adaptive attractor estimation.

作者信息

机构信息

出版信息

相似文献