基于时空贝叶斯融合的视听说话人定界

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1086-1099. doi: 10.1109/TPAMI.2017.2648793. Epub 2017 Jan 5.

DOI:10.1109/TPAMI.2017.2648793

Abstract

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.

摘要

说话人分离是将语音信号分配给参与对话的人。提出了一种视听时空分离模型。该模型非常适合具有挑战性的场景，这些场景包括多个参与者在进行多方交互时四处移动并转头看向其他参与者，而不是面对摄像头和麦克风。多人视觉跟踪与多个语音源定位相结合，以解决语音到人关联问题。后者在一种新颖的视听融合方法中得到解决，其原因如下：从一对麦克风中首先提取双耳谱特征，然后使用监督式视听对齐技术将这些特征映射到图像上，最后使用半监督聚类方法将双耳谱特征分配给可见人员。与以前的工作相比，该方法的主要优势在于它以一种有原则的方式处理多人同时发出的语音信号。说话人分离本身被转化为一个潜在变量时间图形模型，该模型根据每个时间片执行的视听关联过程的输出以及说话人分离变量本身的动态，推断说话人身份和说话轮次。所提出的公式产生了一种有效的精确推断过程。介绍了一个新的数据集，该数据集包含视听训练数据以及一些涉及多个参与者参与正式和非正式对话的场景。针对几种最先进的说话人分离算法，对所提出的方法进行了全面测试和基准测试。