• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于时空贝叶斯融合的视听说话人定界

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.

DOI:10.1109/TPAMI.2017.2648793
PMID:28103192
Abstract

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.

摘要

说话人分离是将语音信号分配给参与对话的人。提出了一种视听时空分离模型。该模型非常适合具有挑战性的场景,这些场景包括多个参与者在进行多方交互时四处移动并转头看向其他参与者,而不是面对摄像头和麦克风。多人视觉跟踪与多个语音源定位相结合,以解决语音到人关联问题。后者在一种新颖的视听融合方法中得到解决,其原因如下:从一对麦克风中首先提取双耳谱特征,然后使用监督式视听对齐技术将这些特征映射到图像上,最后使用半监督聚类方法将双耳谱特征分配给可见人员。与以前的工作相比,该方法的主要优势在于它以一种有原则的方式处理多人同时发出的语音信号。说话人分离本身被转化为一个潜在变量时间图形模型,该模型根据每个时间片执行的视听关联过程的输出以及说话人分离变量本身的动态,推断说话人身份和说话轮次。所提出的公式产生了一种有效的精确推断过程。介绍了一个新的数据集,该数据集包含视听训练数据以及一些涉及多个参与者参与正式和非正式对话的场景。针对几种最先进的说话人分离算法,对所提出的方法进行了全面测试和基准测试。

相似文献

1
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.基于时空贝叶斯融合的视听说话人定界
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.
2
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.基于预训练的视听同步模型的多模态说话人分割。
Sensors (Basel). 2019 Nov 25;19(23):5163. doi: 10.3390/s19235163.
3
Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library.基于 PyAnnote 音频处理库的监督式说话人标注系统的开发。
Sensors (Basel). 2023 Feb 13;23(4):2082. doi: 10.3390/s23042082.
4
Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.用于无监督时空说话人分离的多感官融合
Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.
5
Speaker-turn aware diarization for speech-based cognitive assessments.用于基于语音的认知评估的说话轮次感知语音分离
Front Neurosci. 2024 Jan 16;17:1351848. doi: 10.3389/fnins.2023.1351848. eCollection 2023.
6
Supervised Speaker Diarization Using Random Forests: A Tool for Psychotherapy Process Research.使用随机森林的监督式说话人分割:一种心理治疗过程研究工具。
Front Psychol. 2020 Jul 28;11:1726. doi: 10.3389/fpsyg.2020.01726. eCollection 2020.
7
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.使用深度神经嵌入的半监督音频驱动电视新闻说话人分割
J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.
8
Multimodal Speaker Diarization.多模态说话人分割。
IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93. doi: 10.1109/TPAMI.2011.47. Epub 2011 Mar 10.
9
End-to-end neural speaker diarization with an iterative adaptive attractor estimation.基于迭代自适应吸引子估计的端到端神经说话人聚类
Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.
10
Evaluation of Deep Clustering for Diarization of Aphasic Speech.用于失语症语音分离的深度聚类评估
Stud Health Technol Inform. 2019;260:81-88.

引用本文的文献

1
Multimodality Fusion Aspects of Medical Diagnosis: A Comprehensive Review.医学诊断的多模态融合方面:全面综述
Bioengineering (Basel). 2024 Dec 5;11(12):1233. doi: 10.3390/bioengineering11121233.
2
Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.用于无监督时空说话人分离的多感官融合
Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.
3
Audiovisual Tracking of Multiple Speakers in Smart Spaces.智能空间中多说话者的视听跟踪
Sensors (Basel). 2023 Aug 5;23(15):6969. doi: 10.3390/s23156969.
4
Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing.基于音频-运动多模态感知的现实环境群组通信中的语音识别。
Sensors (Basel). 2020 May 22;20(10):2948. doi: 10.3390/s20102948.
5
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.基于预训练的视听同步模型的多模态说话人分割。
Sensors (Basel). 2019 Nov 25;19(23):5163. doi: 10.3390/s19235163.