声学到面部映射的线性模型：模型参数、数据集大小及跨说话者的泛化能力

A linear model of acoustic-to-facial mapping: model parameters, data set size, and generalization across speakers.

作者信息

Craig Matthew S, van Lieshout Pascal, Wong Willy

机构信息

Institute of Biomaterials and Biomedical Engineering and Oral Dynamics Laboratory, University of Toronto, 160-500 University Avenue, Toronto, Ontario M5G 1V7, Canada.

出版信息

J Acoust Soc Am. 2008 Nov;124(5):3183-90. doi: 10.1121/1.2982369.

DOI:10.1121/1.2982369

PMID:19045802

Abstract

The relationship between acoustic and visual speech is important for understanding speech perception, but it also forms the basis behind a type of facial animator, which can predict facial motion during speech given an acoustic input. This relationship was examined by revisiting a linear transformation model of audio-visual speech production. A mathematical model is constructed whereby the visual aspect of speech is reproduced from the acoustic signal via a linear transformation. Unlike previous studies in this area, this paper will address specific aspects of the model as related to the effects of window size for acoustic framing and the critical size of the training set. On average, facial motion is predicted with a correlation of 0.70 to the recorded motion, when the model is trained and then tested on the same subject. This is comparable to previous studies using either similar or different model approaches. Using a model trained on other subjects and then applying it to a new subject resulted in a prediction correlation of 0.65. Furthermore, acoustic windows of 100 ms and a data set of approximately 40 sentences are required for maximum predictability. The results are interpreted in terms of the underlying assumptions of the model.

摘要

声学语音与视觉语音之间的关系对于理解语音感知很重要，但它也是一种面部动画生成器背后的基础，这种面部动画生成器可以在给定声学输入的情况下预测语音过程中的面部运动。通过重新审视视听语音产生的线性变换模型来研究这种关系。构建了一个数学模型，通过线性变换从声学信号中再现语音的视觉方面。与该领域以前的研究不同，本文将讨论与声学成帧窗口大小的影响和训练集的临界大小相关的模型具体方面。当模型在同一受试者上进行训练然后测试时，面部运动预测与记录的运动的平均相关性为0.70。这与以前使用相似或不同模型方法的研究相当。使用在其他受试者上训练的模型然后将其应用于新受试者，预测相关性为0.65。此外，为了实现最大可预测性，需要100毫秒的声学窗口和大约40个句子的数据集。根据模型的基本假设对结果进行了解释。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

声学到面部映射的线性模型：模型参数、数据集大小及跨说话者的泛化能力

A linear model of acoustic-to-facial mapping: model parameters, data set size, and generalization across speakers.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

声学到面部映射的线性模型：模型参数、数据集大小及跨说话者的泛化能力

A linear model of acoustic-to-facial mapping: model parameters, data set size, and generalization across speakers.

作者信息

机构信息

出版信息

相似文献

引用本文的文献