Lammert Adam C, Narayanan Shrikanth S
Computer Science Department, Swarthmore College, Swarthmore, PA, United States of America.
Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, CA, United States of America.
PLoS One. 2015 Jul 15;10(7):e0132193. doi: 10.1371/journal.pone.0132193. eCollection 2015.
Vocal tract length is highly variable across speakers and determines many aspects of the acoustic speech signal, making it an essential parameter to consider for explaining behavioral variability. A method for accurate estimation of vocal tract length from formant frequencies would afford normalization of interspeaker variability and facilitate acoustic comparisons across speakers. A framework for considering estimation methods is developed from the basic principles of vocal tract acoustics, and an estimation method is proposed that follows naturally from this framework. The proposed method is evaluated using acoustic characteristics of simulated vocal tracts ranging from 14 to 19 cm in length, as well as real-time magnetic resonance imaging data with synchronous audio from five speakers whose vocal tracts range from 14.5 to 18.0 cm in length. Evaluations show improvements in accuracy over previously proposed methods, with 0.631 and 1.277 cm root mean square error on simulated and human speech data, respectively. Empirical results show that the effectiveness of the proposed method is based on emphasizing higher formant frequencies, which seem less affected by speech articulation. Theoretical predictions of formant sensitivity reinforce this empirical finding. Moreover, theoretical insights are explained regarding the reason for differences in formant sensitivity.
声道长度在不同说话者之间差异很大,并且决定了声学语音信号的许多方面,这使得它成为解释行为变异性时需要考虑的一个重要参数。一种从共振峰频率准确估计声道长度的方法将能够对说话者间的变异性进行归一化,并便于对不同说话者的声学特征进行比较。基于声道声学的基本原理开发了一个用于考虑估计方法的框架,并提出了一种自然地源于该框架的估计方法。使用长度在14至19厘米范围内的模拟声道的声学特征,以及来自五名声道长度在14.5至18.0厘米之间的说话者的同步音频的实时磁共振成像数据,对所提出的方法进行了评估。评估表明,与先前提出的方法相比,准确性有所提高,在模拟语音数据和人类语音数据上的均方根误差分别为0.631厘米和1.277厘米。实证结果表明,所提出方法的有效性基于强调较高的共振峰频率,而这些频率似乎受语音清晰度的影响较小。共振峰灵敏度的理论预测强化了这一实证发现。此外,还解释了关于共振峰灵敏度差异原因的理论见解。