• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多模态说话人分割。

Multimodal Speaker Diarization.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93. doi: 10.1109/TPAMI.2011.47. Epub 2011 Mar 10.

DOI:10.1109/TPAMI.2011.47
PMID:21383401
Abstract

We present a novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization. The proposed framework is a Dynamic Bayesian Network (DBN) that is an extension of a factorial Hidden Markov Model (fHMM) and models the people appearing in an audiovisual recording as multimodal entities that generate observations in the audio stream, the video stream, and the joint audiovisual space. The framework is very robust to different contexts, makes no assumptions about the location of the recording equipment, and does not require labeled training data as it acquires the model parameters using the Expectation Maximization (EM) algorithm. We apply the proposed model to two meeting videos and a news broadcast video, all of which come from publicly available data sets. The results acquired in speaker diarization are in favor of the proposed multimodal framework, which outperforms the single modality analysis results and improves over the state-of-the-art audio-based speaker diarization.

摘要

我们提出了一种新颖的概率框架,该框架融合了来自音频和视频模态的信息,以执行说话人分割。所提出的框架是一个动态贝叶斯网络(DBN),它是因子隐马尔可夫模型(fHMM)的扩展,并将出现在视听记录中的人建模为多模态实体,这些实体在音频流、视频流和联合视听空间中生成观测值。该框架对不同的上下文非常稳健,不假设录音设备的位置,并且不需要标记的训练数据,因为它使用期望最大化(EM)算法获取模型参数。我们将所提出的模型应用于两个会议视频和一个新闻广播视频,所有这些视频都来自公开可用的数据集。在说话人分割中获得的结果有利于所提出的多模态框架,该框架优于单一模态分析结果,并优于基于音频的最新说话人分割技术。

相似文献

1
Multimodal Speaker Diarization.多模态说话人分割。
IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93. doi: 10.1109/TPAMI.2011.47. Epub 2011 Mar 10.
2
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.基于预训练的视听同步模型的多模态说话人分割。
Sensors (Basel). 2019 Nov 25;19(23):5163. doi: 10.3390/s19235163.
3
Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings.使用深度神经嵌入的半监督音频驱动电视新闻说话人分割
J Acoust Soc Am. 2020 Dec;148(6):3751. doi: 10.1121/10.0002924.
4
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.基于时空贝叶斯融合的视听说话人定界
IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.
5
Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization.用于无监督时空说话人分离的多感官融合
Sensors (Basel). 2024 Jun 29;24(13):4229. doi: 10.3390/s24134229.
6
Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library.基于 PyAnnote 音频处理库的监督式说话人标注系统的开发。
Sensors (Basel). 2023 Feb 13;23(4):2082. doi: 10.3390/s23042082.
7
Supervised Speaker Diarization Using Random Forests: A Tool for Psychotherapy Process Research.使用随机森林的监督式说话人分割:一种心理治疗过程研究工具。
Front Psychol. 2020 Jul 28;11:1726. doi: 10.3389/fpsyg.2020.01726. eCollection 2020.
8
End-to-end neural speaker diarization with an iterative adaptive attractor estimation.基于迭代自适应吸引子估计的端到端神经说话人聚类
Neural Netw. 2023 Sep;166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.
9
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization.生成对抗网络中用于说话人聚类的潜在空间聚类元学习
IEEE/ACM Trans Audio Speech Lang Process. 2021;29:1204-1219. doi: 10.1109/taslp.2021.3061885. Epub 2021 Feb 26.
10
Speaker-turn aware diarization for speech-based cognitive assessments.用于基于语音的认知评估的说话轮次感知语音分离
Front Neurosci. 2024 Jan 16;17:1351848. doi: 10.3389/fnins.2023.1351848. eCollection 2023.

引用本文的文献

1
Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing.基于音频-运动多模态感知的现实环境群组通信中的语音识别。
Sensors (Basel). 2020 May 22;20(10):2948. doi: 10.3390/s20102948.
2
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.基于预训练的视听同步模型的多模态说话人分割。
Sensors (Basel). 2019 Nov 25;19(23):5163. doi: 10.3390/s19235163.
3
Target Localization and Tracking by Fusing Doppler Differentials from Cellular Emanations with a Multi-Spectral Video Tracker.
通过融合蜂窝发射的多普勒差分与多光谱视频跟踪器实现目标定位和跟踪。
Sensors (Basel). 2018 Oct 30;18(11):3687. doi: 10.3390/s18113687.