Suppr超能文献

多模态说话人分割。

Multimodal Speaker Diarization.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93. doi: 10.1109/TPAMI.2011.47. Epub 2011 Mar 10.

Abstract

We present a novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization. The proposed framework is a Dynamic Bayesian Network (DBN) that is an extension of a factorial Hidden Markov Model (fHMM) and models the people appearing in an audiovisual recording as multimodal entities that generate observations in the audio stream, the video stream, and the joint audiovisual space. The framework is very robust to different contexts, makes no assumptions about the location of the recording equipment, and does not require labeled training data as it acquires the model parameters using the Expectation Maximization (EM) algorithm. We apply the proposed model to two meeting videos and a news broadcast video, all of which come from publicly available data sets. The results acquired in speaker diarization are in favor of the proposed multimodal framework, which outperforms the single modality analysis results and improves over the state-of-the-art audio-based speaker diarization.

摘要

我们提出了一种新颖的概率框架,该框架融合了来自音频和视频模态的信息,以执行说话人分割。所提出的框架是一个动态贝叶斯网络(DBN),它是因子隐马尔可夫模型(fHMM)的扩展,并将出现在视听记录中的人建模为多模态实体,这些实体在音频流、视频流和联合视听空间中生成观测值。该框架对不同的上下文非常稳健,不假设录音设备的位置,并且不需要标记的训练数据,因为它使用期望最大化(EM)算法获取模型参数。我们将所提出的模型应用于两个会议视频和一个新闻广播视频,所有这些视频都来自公开可用的数据集。在说话人分割中获得的结果有利于所提出的多模态框架,该框架优于单一模态分析结果,并优于基于音频的最新说话人分割技术。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验