Ma Jiyong, Cole Ron, Pellom Bryan, Ward Wayne, Wise Barbara
Center for Spoken Language Research, University of Colorado at Boulder, CO 80309-0594, USA.
IEEE Trans Vis Comput Graph. 2006 Mar-Apr;12(2):266-76. doi: 10.1109/TVCG.2006.18.
We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.
我们提出了一种新颖的方法来合成准确的可视语音,该方法基于在大量动作捕捉数据语料库中搜索和拼接最优可变长度单元。基于在源面部上选择的一组视觉原型以及为目标面部指定的相应集合,我们提出了一种机器学习技术,以自动将在源面部上观察到的面部动作映射到目标面部。为了对可视语音中的长距离协同发音效果进行建模,我们收集、标注并分析了一个涵盖英语中最常见音节的大规模语料库。对于任何输入文本,描述了一种用于定位拼接单元的最优序列以进行合成的搜索算法。还提出了一种新算法,用于将通用3D面部模型的唇部动作适配到特定的3D面部模型。基于该方法实现了一个完整的端到端可视语音动画系统。该系统目前在60多个幼儿园至三年级的教室中使用,通过一个逼真的对话动画代理来教授学生阅读。为了评估动画系统生成的可视语音的质量,进行了主观评估和客观评估。评估结果表明,所提出的方法对于可视语音合成是准确且强大的。