基于拼接可变长度动作捕捉数据的精确可视语音合成。

Accurate visible speech synthesis based on concatenating variable length motion capture data.

作者信息

Ma Jiyong, Cole Ron, Pellom Bryan, Ward Wayne, Wise Barbara

机构信息

Center for Spoken Language Research, University of Colorado at Boulder, CO 80309-0594, USA.

出版信息

IEEE Trans Vis Comput Graph. 2006 Mar-Apr;12(2):266-76. doi: 10.1109/TVCG.2006.18.

DOI:10.1109/TVCG.2006.18

PMID:16509385

Abstract

We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.

摘要

我们提出了一种新颖的方法来合成准确的可视语音，该方法基于在大量动作捕捉数据语料库中搜索和拼接最优可变长度单元。基于在源面部上选择的一组视觉原型以及为目标面部指定的相应集合，我们提出了一种机器学习技术，以自动将在源面部上观察到的面部动作映射到目标面部。为了对可视语音中的长距离协同发音效果进行建模，我们收集、标注并分析了一个涵盖英语中最常见音节的大规模语料库。对于任何输入文本，描述了一种用于定位拼接单元的最优序列以进行合成的搜索算法。还提出了一种新算法，用于将通用3D面部模型的唇部动作适配到特定的3D面部模型。基于该方法实现了一个完整的端到端可视语音动画系统。该系统目前在60多个幼儿园至三年级的教室中使用，通过一个逼真的对话动画代理来教授学生阅读。为了评估动画系统生成的可视语音的质量，进行了主观评估和客观评估。评估结果表明，所提出的方法对于可视语音合成是准确且强大的。

相似文献

Accurate visible speech synthesis based on concatenating variable length motion capture data.

IEEE Trans Vis Comput Graph. 2006 Mar-Apr;12(2):266-76. doi: 10.1109/TVCG.2006.18.

Creating speech-synchronized animation.

IEEE Trans Vis Comput Graph. 2005 May-Jun;11(3):341-52. doi: 10.1109/TVCG.2005.43.

Expressive facial animation synthesis by learning speech coarticulation and expression spaces.

IEEE Trans Vis Comput Graph. 2006 Nov-Dec;12(6):1523-34. doi: 10.1109/TVCG.2006.90.

Transferring of speech movements from video to 3D face space.

IEEE Trans Vis Comput Graph. 2007 Jan-Feb;13(1):58-69. doi: 10.1109/TVCG.2007.22.

Face recognition using face-ARG matching.

IEEE Trans Pattern Anal Mach Intell. 2005 Dec;27(12):1982-8. doi: 10.1109/TPAMI.2005.243.

Reflectance from images: a model-based approach for human faces.

IEEE Trans Vis Comput Graph. 2005 May-Jun;11(3):296-305. doi: 10.1109/TVCG.2005.47.

Analysis and synthesis of textured motion: particles and waves.

IEEE Trans Pattern Anal Mach Intell. 2004 Oct;26(10):1348-63. doi: 10.1109/TPAMI.2004.76.

Model-based hand tracking using a hierarchical Bayesian filter.

IEEE Trans Pattern Anal Mach Intell. 2006 Sep;28(9):1372-84. doi: 10.1109/TPAMI.2006.189.

Orthogonal-blendshape-based editing system for facial motion capture data.

IEEE Comput Graph Appl. 2008 Nov-Dec;28(6):76-82. doi: 10.1109/MCG.2008.120.

A video database of moving faces and people.

IEEE Trans Pattern Anal Mach Intell. 2005 May;27(5):812-6. doi: 10.1109/TPAMI.2005.90.

引用本文的文献

Visual speech discrimination and identification of natural and synthetic consonant stimuli.

Front Psychol. 2015 Jul 13;6:878. doi: 10.3389/fpsyg.2015.00878. eCollection 2015.

A Virtual Therapist for Speech and Language Therapy.

Intell Virtual Agents. 2014 Jan 1;8637:438-448. doi: 10.1007/978-3-319-09767-1_55.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于拼接可变长度动作捕捉数据的精确可视语音合成。

Accurate visible speech synthesis based on concatenating variable length motion capture data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献