Assmann P F, Summerfield Q
MRC Institute of Hearing Research, University Park, Nottingham, England.
J Acoust Soc Am. 1989 Jan;85(1):327-38. doi: 10.1121/1.397684.
The ability of listeners to identify pairs of simultaneous synthetic vowels has been investigated in the first of a series of studies on the extraction of phonetic information from multiple-talker waveforms. Both members of the vowel pair had the same onset and offset times and a constant fundamental frequency of 100 Hz. Listeners identified both vowels with an accuracy significantly greater than chance. The pattern of correct responses and confusions was similar for vowels generated by (a) cascade formant synthesis and (b) additive harmonic synthesis that replaced each of the lowest three formants with a single pair of harmonics of equal amplitude. In order to choose an appropriate model for describing listeners' performance, four pattern-matching procedures were evaluated. Each predicted the probability that (i) any individual vowel would be selected as one of the two responses, and (ii) any pair of vowels would be selected. These probabilities were estimated from measures of the similarities of the auditory excitation patterns of the double vowels to those of single-vowel reference patterns. Up to 88% of the variance in individual responses and up to 67% of the variance in pairwise responses could be accounted for by procedures that highlighted spectral peaks and shoulders in the excitation pattern. Procedures that assigned uniform weight to all regions of the excitation pattern gave poorer predictions. These findings support the hypothesis that the auditory system pays particular attention to the frequencies of spectral peaks, and possibly also of shoulders, when identifying vowels. One virtue of this strategy is that the spectral peaks and shoulders can indicate the frequencies of formants when other aspects of spectral shape are obscured by competing sounds.
在一系列从多说话者波形中提取语音信息的研究中,第一项研究调查了听众识别成对同时出现的合成元音的能力。元音对中的两个元音起始和结束时间相同,基频恒定为100赫兹。听众识别这两个元音的准确率显著高于随机水平。对于由(a)级联共振峰合成和(b)加法谐波合成产生的元音,正确反应和混淆模式相似,加法谐波合成是用一对等幅的最低三个谐波中的每一个来替代。为了选择一个合适的模型来描述听众的表现,评估了四种模式匹配程序。每种程序预测了(i)任何单个元音被选为两个反应之一的概率,以及(ii)任何一对元音被选中的概率。这些概率是根据双元音的听觉兴奋模式与单元音参考模式的相似性测量来估计的。突出兴奋模式中的频谱峰值和波峰的程序可以解释个体反应中高达88%的方差和成对反应中高达67%的方差。给兴奋模式的所有区域赋予均匀权重的程序预测效果较差。这些发现支持了这样一种假设,即听觉系统在识别元音时特别关注频谱峰值的频率,可能还关注波峰的频率。这种策略的一个优点是,当频谱形状的其他方面被竞争声音掩盖时,频谱峰值和波峰可以指示共振峰的频率。