de Cheveigné A, Kawahara H
Laboratoire de Linguistique Formelle, CNRS/Université Paris 7, France.
J Acoust Soc Am. 1999 Jun;105(6):3497-508. doi: 10.1121/1.424675.
Vowel identity correlates well with the shape of the transfer function of the vocal tract, in particular the position of the first two or three formant peaks. However, in voiced speech the transfer function is sampled at multiples of the fundamental frequency (F0), and the short-term spectrum contains peaks at those frequencies, rather than at formants. It is not clear how the auditory system estimates the original spectral envelope from the vowel waveform. Cochlear excitation patterns, for example, resolve harmonics in the low-frequency region and their shape varies strongly with F0. The problem cannot be cured by smoothing: lag-domain components of the spectral envelope are aliased and cause F0-dependent distortion. The problem is severe at high F0's where the spectral envelope is severely undersampled. This paper treats vowel identification as a process of pattern recognition with missing data. Matching is restricted to available data, and missing data are ignored using an F0-dependent weighting function that emphasizes regions near harmonics. The model is presented in two versions: a frequency-domain version based on short-term spectra, or tonotopic excitation patterns, and a time-domain version based on autocorrelation functions. It accounts for the relative F0-independency observed in vowel identification.
元音识别与声道传递函数的形状密切相关,尤其是前两三个共振峰的位置。然而,在浊音语音中,传递函数以基频(F0)的倍数进行采样,短期频谱在这些频率处包含峰值,而非在共振峰处。目前尚不清楚听觉系统如何从元音波形中估计原始频谱包络。例如,耳蜗兴奋模式在低频区域解析谐波,其形状随F0变化很大。通过平滑无法解决该问题:频谱包络的滞后域分量会产生混叠,并导致与F0相关的失真。在高F0时,频谱包络严重欠采样,该问题尤为严重。本文将元音识别视为一个处理缺失数据的模式识别过程。匹配仅限于可用数据,使用强调谐波附近区域的与F0相关的加权函数忽略缺失数据。该模型有两个版本:基于短期频谱或声调拓扑兴奋模式的频域版本,以及基于自相关函数的时域版本。它解释了元音识别中观察到的相对F0独立性。