Zahorian S A, Jagharghi A J
Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, Virginia 23529.
J Acoust Soc Am. 1993 Oct;94(4):1966-82. doi: 10.1121/1.407520.
The first three formants, i.e., the first three spectral prominences of the short-time magnitude spectra, have been the most commonly used acoustic cues for vowels ever since the work of Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)]. However, spectral shape features, which encode the global smoothed spectrum, provide a more complete spectral description, and therefore might be even better acoustic correlates for vowels. In this study automatic vowel classification experiments were used to compare formants and spectral-shape features for monopthongal vowels spoken in the context of isolated CVC words, under a variety of conditions. The roles of static and time-varying information for vowel discrimination were also compared. Spectral shape was encoded using the coefficients in a cosine expansion of the nonlinearly scaled magnitude spectrum. Under almost all conditions investigated, in the absence of fundamental frequency (F0) information, automatic vowel classification based on spectral-shape features was superior to that based on formants. If F0 was used as an additional feature, vowel classification based on spectral shape features was still superior to that based on formants, but the differences between the two feature sets were reduced. It was also found that the error pattern of perceptual confusions was more closely correlated with errors in automatic classification obtained from spectral-shape features than with classification errors from formants. Therefore it is concluded that spectral-shape features are a more complete set of acoustic correlates for vowel identity than are formants. In comparing static and time-varying features, static features were the most important for vowel discrimination, but feature trajectories were valuable secondary sources of information.
自彼得森和巴尼的研究[《美国声学学会杂志》24, 175 - 184 (1952)]以来,前三个共振峰,即短时幅度谱的前三个频谱峰值,一直是元音最常用的声学线索。然而,编码全局平滑频谱的频谱形状特征提供了更完整的频谱描述,因此可能是更好的元音声学相关特征。在本研究中,使用自动元音分类实验,在各种条件下,比较孤立的CVC单词语境中单元音的共振峰和频谱形状特征。还比较了静态和时变信息在元音辨别中的作用。频谱形状通过非线性缩放幅度谱的余弦展开系数进行编码。在几乎所有研究的条件下,在没有基频(F0)信息时,基于频谱形状特征的自动元音分类优于基于共振峰的分类。如果将F0用作附加特征,基于频谱形状特征的元音分类仍优于基于共振峰的分类,但两个特征集之间的差异减小。还发现,感知混淆的错误模式与从频谱形状特征获得的自动分类错误的相关性,比与共振峰分类错误的相关性更紧密。因此得出结论,与共振峰相比,频谱形状特征是更完整的元音身份声学相关特征集。在比较静态和时变特征时,静态特征对元音辨别最为重要,但特征轨迹是有价值的次要信息来源。