Suppr超能文献

用于无监督时空说话人分离的多感官融合

Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.

作者信息

Xylogiannis Paris, Vryzas Nikolaos, Vrysis Lazaros, Dimoulas Charalampos

机构信息

Multidisciplinary Media & Mediated Communication Research Group (M3C), Aristotle University, 54636 Thessaloniki, Greece.

出版信息

Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.

Abstract

Speaker diarization consists of answering the question of "who spoke when" in audio recordings. In meeting scenarios, the task of labeling audio with the corresponding speaker identities can be further assisted by the exploitation of spatial features. This work proposes a framework designed to assess the effectiveness of combining speaker embeddings with Time Difference of Arrival (TDOA) values from available microphone sensor arrays in meetings. We extract speaker embeddings using two popular and robust pre-trained models, ECAPA-TDNN and X-vectors, and calculate the TDOA values via the Generalized Cross-Correlation (GCC) method with Phase Transform (PHAT) weighting. Although ECAPA-TDNN outperforms the Xvectors model, we utilize both speaker embedding models to explore the potential of employing a computationally lighter model when spatial information is exploited. Various techniques for combining the spatial-temporal information are examined in order to determine the best clustering method. The proposed framework is evaluated on two multichannel datasets: the AVLab Speaker Localization dataset and a multichannel dataset (SpeaD-M3C) enriched in the context of the present work with supplementary information from smartphone recordings. Our results strongly indicate that the integration of spatial information can significantly improve the performance of state-of-the-art deep learning diarization models, presenting a 2-3% reduction in DER compared to the baseline approach on the evaluated datasets.

摘要

说话人聚类旨在回答录音中“谁在何时说话”的问题。在会议场景中,利用空间特征可以进一步辅助将音频标记为相应说话人身份的任务。这项工作提出了一个框架,旨在评估将说话人嵌入与会议中可用麦克风传感器阵列的到达时间差(TDOA)值相结合的有效性。我们使用两种流行且强大的预训练模型ECAPA-TDNN和X-向量来提取说话人嵌入,并通过带相位变换(PHAT)加权的广义互相关(GCC)方法计算TDOA值。尽管ECAPA-TDNN优于X-向量模型,但我们使用这两种说话人嵌入模型来探索在利用空间信息时采用计算量较小的模型的潜力。研究了各种用于组合时空信息的技术,以确定最佳的聚类方法。在两个多通道数据集上对所提出的框架进行了评估:AVLab说话人定位数据集和一个在本工作背景下通过智能手机录音的补充信息丰富的多通道数据集(SpeaD-M3C)。我们的结果有力地表明,空间信息的整合可以显著提高当前深度学习聚类模型的性能,与评估数据集上的基线方法相比,DER降低了2-3%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/6149167d2e5e/sensors-24-04229-g001.jpg

相似文献

1
Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.
Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.
2
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.
Sensors (Basel). 2019 Nov 25;19(23):5163. doi: 10.3390/s19235163.
3
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization.
IEEE/ACM Trans Audio Speech Lang Process. 2021;29:1204-1219. doi: 10.1109/taslp.2021.3061885. Epub 2021 Feb 26.
4
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.
5
Speaker-turn aware diarization for speech-based cognitive assessments.
Front Neurosci. 2024 Jan 16;17:1351848. doi: 10.3389/fnins.2023.1351848. eCollection 2023.
6
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.
J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.
8
End-to-end neural speaker diarization with an iterative adaptive attractor estimation.
Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.
9
Multimodal Speaker Diarization.
IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93. doi: 10.1109/TPAMI.2011.47. Epub 2011 Mar 10.
10
Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection.
Front Neurosci. 2023 Mar 23;17:1141621. doi: 10.3389/fnins.2023.1141621. eCollection 2023.

本文引用的文献

1
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.
J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.
2
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验