Ru Powen, Chi Taishih, Shamma Shihab
Center for Auditory and Acoustics Research, Institute for Systems Research, Electrical and Computer Engineering Department, University of Maryland, College Park, Maryland 20742, USA.
J Acoust Soc Am. 2003 Jan;113(1):498-515. doi: 10.1121/1.1525288.
Speech intelligibility is known to be relatively unaffected by certain deformations of the acoustic spectrum. These include translations, stretching or contracting dilations, and shearing of the spectrum (represented along the logarithmic frequency axis). It is argued here that such robustness reflects a synergy between vocal production and auditory perception. Thus, on the one hand, it is shown that these spectral distortions are produced by common and unavoidable variations among different speakers pertaining to the length, cross-sectional profile, and losses of their vocal tracts. On the other hand, it is argued that these spectral changes leave the auditory cortical representation of the spectrum largely unchanged except for translations along one of its representational axes. These assertions are supported by analyses of production and perception models. On the production side, a simplified sinusoidal model of the vocal tract is developed which analytically relates a few "articulatory" parameters, such as the extent and location of the vocal tract constriction, to the spectral peaks of the acoustic spectra synthesized from it. The model is evaluated by comparing the identification of synthesized sustained vowels to labeled natural vowels extracted from the TIMIT corpus. On the perception side a "multiscale" model of sound processing is utilized to elucidate the effects of the deformations on the representation of the acoustic spectrum in the primary auditory cortex. Finally, the implications of these results for the perception of generally identifiable classes of sound sources beyond the specific case of speech and the vocal tract are discussed.
众所周知,语音清晰度相对不受声谱某些变形的影响。这些变形包括平移、拉伸或收缩扩张以及声谱的剪切(沿对数频率轴表示)。本文认为,这种鲁棒性反映了发声产生与听觉感知之间的协同作用。因此,一方面,研究表明这些频谱失真由不同说话者之间与声道长度、横截面轮廓及其损耗相关的常见且不可避免的变化所产生。另一方面,有人认为这些频谱变化除了沿其表示轴之一的平移外,在很大程度上不会改变频谱在听觉皮层中的表示。这些断言得到了对产生和感知模型的分析的支持。在发声产生方面,开发了一种简化的声道正弦模型,该模型分析性地将一些“发音”参数(如声道收缩的程度和位置)与由其合成的声谱的频谱峰值联系起来。通过将合成的持续元音的识别与从TIMIT语料库中提取的带标签的自然元音进行比较来评估该模型。在感知方面,利用一种“多尺度”声音处理模型来阐明变形对初级听觉皮层中声谱表示的影响。最后,讨论了这些结果对于除语音和声道特定情况之外的一般可识别声源类别的感知的影响。