• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于无监督时空说话人分离的多感官融合

Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.

作者信息

Xylogiannis Paris, Vryzas Nikolaos, Vrysis Lazaros, Dimoulas Charalampos

机构信息

Multidisciplinary Media & Mediated Communication Research Group (M3C), Aristotle University, 54636 Thessaloniki, Greece.

出版信息

Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.

DOI:10.3390/s24134229
PMID:39001008
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11243859/
Abstract

Speaker diarization consists of answering the question of "who spoke when" in audio recordings. In meeting scenarios, the task of labeling audio with the corresponding speaker identities can be further assisted by the exploitation of spatial features. This work proposes a framework designed to assess the effectiveness of combining speaker embeddings with Time Difference of Arrival (TDOA) values from available microphone sensor arrays in meetings. We extract speaker embeddings using two popular and robust pre-trained models, ECAPA-TDNN and X-vectors, and calculate the TDOA values via the Generalized Cross-Correlation (GCC) method with Phase Transform (PHAT) weighting. Although ECAPA-TDNN outperforms the Xvectors model, we utilize both speaker embedding models to explore the potential of employing a computationally lighter model when spatial information is exploited. Various techniques for combining the spatial-temporal information are examined in order to determine the best clustering method. The proposed framework is evaluated on two multichannel datasets: the AVLab Speaker Localization dataset and a multichannel dataset (SpeaD-M3C) enriched in the context of the present work with supplementary information from smartphone recordings. Our results strongly indicate that the integration of spatial information can significantly improve the performance of state-of-the-art deep learning diarization models, presenting a 2-3% reduction in DER compared to the baseline approach on the evaluated datasets.

摘要

说话人聚类旨在回答录音中“谁在何时说话”的问题。在会议场景中,利用空间特征可以进一步辅助将音频标记为相应说话人身份的任务。这项工作提出了一个框架,旨在评估将说话人嵌入与会议中可用麦克风传感器阵列的到达时间差(TDOA)值相结合的有效性。我们使用两种流行且强大的预训练模型ECAPA-TDNN和X-向量来提取说话人嵌入,并通过带相位变换(PHAT)加权的广义互相关(GCC)方法计算TDOA值。尽管ECAPA-TDNN优于X-向量模型,但我们使用这两种说话人嵌入模型来探索在利用空间信息时采用计算量较小的模型的潜力。研究了各种用于组合时空信息的技术,以确定最佳的聚类方法。在两个多通道数据集上对所提出的框架进行了评估:AVLab说话人定位数据集和一个在本工作背景下通过智能手机录音的补充信息丰富的多通道数据集(SpeaD-M3C)。我们的结果有力地表明,空间信息的整合可以显著提高当前深度学习聚类模型的性能,与评估数据集上的基线方法相比,DER降低了2-3%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/444b5bd6a1fd/sensors-24-04229-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/6149167d2e5e/sensors-24-04229-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/53195c821017/sensors-24-04229-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/c28ce34244b2/sensors-24-04229-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/444b5bd6a1fd/sensors-24-04229-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/6149167d2e5e/sensors-24-04229-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/53195c821017/sensors-24-04229-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/c28ce34244b2/sensors-24-04229-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/12d5/11243859/444b5bd6a1fd/sensors-24-04229-g004.jpg

相似文献

1
Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.用于无监督时空说话人分离的多感官融合
Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.
2
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.基于预训练的视听同步模型的多模态说话人分割。
Sensors (Basel). 2019 Nov 25;19(23):5163. doi: 10.3390/s19235163.
3
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization.生成对抗网络中用于说话人聚类的潜在空间聚类元学习
IEEE/ACM Trans Audio Speech Lang Process. 2021;29:1204-1219. doi: 10.1109/taslp.2021.3061885. Epub 2021 Feb 26.
4
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.基于时空贝叶斯融合的视听说话人定界
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.
5
Speaker-turn aware diarization for speech-based cognitive assessments.用于基于语音的认知评估的说话轮次感知语音分离
Front Neurosci. 2024 Jan 16;17:1351848. doi: 10.3389/fnins.2023.1351848. eCollection 2023.
6
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.使用深度神经嵌入的半监督音频驱动电视新闻说话人分割
J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.
7
Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library.基于 PyAnnote 音频处理库的监督式说话人标注系统的开发。
Sensors (Basel). 2023 Feb 13;23(4):2082. doi: 10.3390/s23042082.
8
End-to-end neural speaker diarization with an iterative adaptive attractor estimation.基于迭代自适应吸引子估计的端到端神经说话人聚类
Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.
9
Multimodal Speaker Diarization.多模态说话人分割。
IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93. doi: 10.1109/TPAMI.2011.47. Epub 2011 Mar 10.
10
Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection.在用于抑郁症检测的多语音任务刺激中结合说话人嵌入的集成学习。
Front Neurosci. 2023 Mar 23;17:1141621. doi: 10.3389/fnins.2023.1141621. eCollection 2023.

本文引用的文献

1
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.使用深度神经嵌入的半监督音频驱动电视新闻说话人分割
J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.
2
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.基于时空贝叶斯融合的视听说话人定界
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.