利用在X射线微束数据上训练的神经网络从声学中推断发音并识别手势。

Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data.

作者信息

Papcun G, Hochberg J, Thomas T R, Laroche F, Zacks J, Levy S

机构信息

Computing and Communications Division, Los Alamos National Laboratory, New Mexico 87545.

出版信息

J Acoust Soc Am. 1992 Aug;92(2 Pt 1):688-700. doi: 10.1121/1.403994.

DOI:10.1121/1.403994

PMID:1506525

Abstract

This paper describes a method for inferring articulatory parameters from acoustics with a neural network trained on paired acoustic and articulatory data. An x-ray microbeam recorded the vertical movements of the lower lip, tongue tip, and tongue dorsum of three speakers saying the English stop consonants in repeated Ce syllables. A neural network was then trained to map from simultaneously recorded acoustic data to the articulatory data. To evaluate learning, acoustics from the training set were passed through the neural network. To evaluate generalization, acoustics from speakers or consonants excluded from the training set were passed through the network. The articulatory trajectories thus inferred were a good fit to the actual movements in both the learning and generalization conditions, as judged by root-mean-square error and correlation. Inferred trajectories were also matched to templates of lower lip, tongue tip, and tongue dorsum release gestures extracted from the original data. This technique correctly recognized from 94.4% to 98.9% of all gestures in the learning and cross-speaker generalization conditions, and 75% of gestures underlying consonants excluded from the training set. In addition, greater regularity was observed for movements of articulators that were critical in the formation of each consonant.

摘要

本文描述了一种通过在成对的声学和发音数据上训练的神经网络从声学中推断发音参数的方法。用X射线微束记录了三名说英语的人在重复的Ce音节中发英语塞音时下唇、舌尖和舌背的垂直运动。然后训练一个神经网络，将同时记录的声学数据映射到发音数据。为了评估学习情况，将训练集的声学数据输入神经网络。为了评估泛化能力，将训练集中未包含的说话者或辅音的声学数据输入网络。通过均方根误差和相关性判断，在学习和泛化条件下，由此推断出的发音轨迹与实际运动非常吻合。推断出的轨迹也与从原始数据中提取的下唇、舌尖和舌背释放手势模板相匹配。在学习和跨说话者泛化条件下，该技术正确识别了所有手势的94.4%至98.9%，以及训练集中未包含的辅音手势的75%。此外，对于在每个辅音形成中起关键作用的发音器官的运动，观察到了更高的规律性。