Haskins Laboratories, 300 George Street, Suite 900, New Haven, Connecticut 06511, USA.
J Acoust Soc Am. 2012 Dec;132(6):3980-9. doi: 10.1121/1.4763545.
Speech can be represented as a constellation of constricting vocal tract actions called gestures, whose temporal patterning with respect to one another is expressed in a gestural score. Current speech datasets do not come with gestural annotation and no formal gestural annotation procedure exists at present. This paper describes an iterative analysis-by-synthesis landmark-based time-warping architecture to perform gestural annotation of natural speech. For a given utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model is employed to generate a corresponding prototype gestural score. The gestural score is temporally optimized through an iterative timing-warping process such that the acoustic distance between the original and TADA-synthesized speech is minimized. This paper demonstrates that the proposed iterative approach is superior to conventional acoustically-referenced dynamic timing-warping procedures and provides reliable gestural annotation for speech datasets.
言语可以表示为一系列称为姿势的声道收缩动作的组合,这些动作相对于彼此的时间模式在手势谱中得到表达。目前的语音数据集没有手势注释,目前也没有正式的手势注释程序。本文描述了一种基于迭代分析-综合地标时间 warp 的架构,用于对自然语音进行手势注释。对于给定的话语,哈斯金斯实验室任务动态和应用(TADA)模型被用来生成一个相应的原型手势谱。通过迭代时间 warp 过程对手势谱进行时间优化,使得原始语音和 TADA 合成语音之间的声学距离最小化。本文证明,所提出的迭代方法优于传统的声学参考动态时间 warp 方法,并为语音数据集提供了可靠的手势注释。