IEEE Trans Pattern Anal Mach Intell. 2012 Jan;34(1):79-93. doi: 10.1109/TPAMI.2011.47. Epub 2011 Mar 10.
We present a novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization. The proposed framework is a Dynamic Bayesian Network (DBN) that is an extension of a factorial Hidden Markov Model (fHMM) and models the people appearing in an audiovisual recording as multimodal entities that generate observations in the audio stream, the video stream, and the joint audiovisual space. The framework is very robust to different contexts, makes no assumptions about the location of the recording equipment, and does not require labeled training data as it acquires the model parameters using the Expectation Maximization (EM) algorithm. We apply the proposed model to two meeting videos and a news broadcast video, all of which come from publicly available data sets. The results acquired in speaker diarization are in favor of the proposed multimodal framework, which outperforms the single modality analysis results and improves over the state-of-the-art audio-based speaker diarization.
我们提出了一种新颖的概率框架,该框架融合了来自音频和视频模态的信息,以执行说话人分割。所提出的框架是一个动态贝叶斯网络(DBN),它是因子隐马尔可夫模型(fHMM)的扩展,并将出现在视听记录中的人建模为多模态实体,这些实体在音频流、视频流和联合视听空间中生成观测值。该框架对不同的上下文非常稳健,不假设录音设备的位置,并且不需要标记的训练数据,因为它使用期望最大化(EM)算法获取模型参数。我们将所提出的模型应用于两个会议视频和一个新闻广播视频,所有这些视频都来自公开可用的数据集。在说话人分割中获得的结果有利于所提出的多模态框架,该框架优于单一模态分析结果,并优于基于音频的最新说话人分割技术。