• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

声学到面部映射的线性模型:模型参数、数据集大小及跨说话者的泛化能力

A linear model of acoustic-to-facial mapping: model parameters, data set size, and generalization across speakers.

作者信息

Craig Matthew S, van Lieshout Pascal, Wong Willy

机构信息

Institute of Biomaterials and Biomedical Engineering and Oral Dynamics Laboratory, University of Toronto, 160-500 University Avenue, Toronto, Ontario M5G 1V7, Canada.

出版信息

J Acoust Soc Am. 2008 Nov;124(5):3183-90. doi: 10.1121/1.2982369.

DOI:10.1121/1.2982369
PMID:19045802
Abstract

The relationship between acoustic and visual speech is important for understanding speech perception, but it also forms the basis behind a type of facial animator, which can predict facial motion during speech given an acoustic input. This relationship was examined by revisiting a linear transformation model of audio-visual speech production. A mathematical model is constructed whereby the visual aspect of speech is reproduced from the acoustic signal via a linear transformation. Unlike previous studies in this area, this paper will address specific aspects of the model as related to the effects of window size for acoustic framing and the critical size of the training set. On average, facial motion is predicted with a correlation of 0.70 to the recorded motion, when the model is trained and then tested on the same subject. This is comparable to previous studies using either similar or different model approaches. Using a model trained on other subjects and then applying it to a new subject resulted in a prediction correlation of 0.65. Furthermore, acoustic windows of 100 ms and a data set of approximately 40 sentences are required for maximum predictability. The results are interpreted in terms of the underlying assumptions of the model.

摘要

声学语音与视觉语音之间的关系对于理解语音感知很重要,但它也是一种面部动画生成器背后的基础,这种面部动画生成器可以在给定声学输入的情况下预测语音过程中的面部运动。通过重新审视视听语音产生的线性变换模型来研究这种关系。构建了一个数学模型,通过线性变换从声学信号中再现语音的视觉方面。与该领域以前的研究不同,本文将讨论与声学成帧窗口大小的影响和训练集的临界大小相关的模型具体方面。当模型在同一受试者上进行训练然后测试时,面部运动预测与记录的运动的平均相关性为0.70。这与以前使用相似或不同模型方法的研究相当。使用在其他受试者上训练的模型然后将其应用于新受试者,预测相关性为0.65。此外,为了实现最大可预测性,需要100毫秒的声学窗口和大约40个句子的数据集。根据模型的基本假设对结果进行了解释。

相似文献

1
A linear model of acoustic-to-facial mapping: model parameters, data set size, and generalization across speakers.声学到面部映射的线性模型:模型参数、数据集大小及跨说话者的泛化能力
J Acoust Soc Am. 2008 Nov;124(5):3183-90. doi: 10.1121/1.2982369.
2
Prediction of acoustic feature parameters using myoelectric signals.利用肌电信号预测声音特征参数。
IEEE Trans Biomed Eng. 2010 Jul;57(7):1587-95. doi: 10.1109/TBME.2010.2041455. Epub 2010 Feb 17.
3
Contributions of oral and extraoral facial movement to visual and audiovisual speech perception.口腔和口外面部运动对视觉和视听言语感知的作用。
J Exp Psychol Hum Percept Perform. 2004 Oct;30(5):873-88. doi: 10.1037/0096-1523.30.5.873.
4
Analysis and prediction of acoustic speech features from mel-frequency cepstral coefficients in distributed speech recognition architectures.分布式语音识别架构中基于梅尔频率倒谱系数的声学语音特征分析与预测
J Acoust Soc Am. 2008 Dec;124(6):3989-4000. doi: 10.1121/1.2997436.
5
Visual speech improves the intelligibility of time-expanded auditory speech.视觉语音可提高时间扩展听觉语音的可懂度。
Neuroreport. 2009 Mar 25;20(5):473-7. doi: 10.1097/WNR.0b013e3283279ae8.
6
Expressive facial animation synthesis by learning speech coarticulation and expression spaces.通过学习语音协同发音和表情空间实现表情丰富的面部动画合成。
IEEE Trans Vis Comput Graph. 2006 Nov-Dec;12(6):1523-34. doi: 10.1109/TVCG.2006.90.
7
Effects of room acoustics on the intelligibility of speech in classrooms for young children.室内声学对幼儿教室中语音清晰度的影响。
J Acoust Soc Am. 2009 Feb;125(2):922-33. doi: 10.1121/1.3058900.
8
Nonuniform speaker normalization using affine transformation.使用仿射变换的非均匀说话人归一化。
J Acoust Soc Am. 2008 Sep;124(3):1727-38. doi: 10.1121/1.2951597.
9
Monaural room acoustic parameters from music and speech.来自音乐和语音的单声道房间声学参数。
J Acoust Soc Am. 2008 Jul;124(1):278-87. doi: 10.1121/1.2931960.
10
Somatosensory basis of speech production.言语产生的躯体感觉基础。
Nature. 2003 Jun 19;423(6942):866-9. doi: 10.1038/nature01710.

引用本文的文献

1
Common cues to emotion in the dynamic facial expressions of speech and song.言语和歌曲动态面部表情中常见的情感线索。
Q J Exp Psychol (Hove). 2015;68(5):952-70. doi: 10.1080/17470218.2014.971034. Epub 2014 Nov 25.
2
Psychophysics of the McGurk and other audiovisual speech integration effects.麦格克效应和其他视听言语整合效应的心理物理学
J Exp Psychol Hum Percept Perform. 2011 Aug;37(4):1193-209. doi: 10.1037/a0023100.