Sargin Mehmet E, Yemez Yucel, Erzin Engin, Tekalp Ahmet M
Department of Electrical and Computer Engineering, University of California-Santa Barbara, Santa Barbara, CA 93106-9560, USA.
IEEE Trans Pattern Anal Mach Intell. 2008 Aug;30(8):1330-45. doi: 10.1109/TPAMI.2007.70797.
We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in "prosody transplant" and gesture transplant" scenarios.
我们提出了一种新的两阶段框架,用于联合分析说话者的头部手势和语音韵律模式,以从语音韵律自动生成逼真的头部手势合成。在第一阶段分析中,我们分别对头部手势和语音韵律特征执行基于隐马尔可夫模型(HMM)的无监督时间分割,以分别确定特定说话者的基本头部手势和语音韵律模式。在第二阶段,使用多流HMM对这些基本头部手势和韵律模式之间的相关性进行联合分析,以确定视听映射模型。然后,给定说话者的头部模型,使用所得的视听映射模型从任意输入测试语音合成自然的头部手势。在合成阶段,视听映射模型用于从为输入测试语音计算的韵律模式序列预测手势模式序列。然后将与每个手势模式相关联的欧拉角应用于对说话者头部模型进行动画处理。客观和主观评估表明,所提出的通过分析方案进行的合成,对于任何输入测试语音,以及在“韵律移植”和“手势移植”场景中,都能为说话者提供看起来自然的头部手势。