高保真语音的元音共振峰辨别

Vowel formant discrimination for high-fidelity speech.

作者信息

Liu Chang, Kewley-Port Diane

机构信息

Department of Speech and Hearing Sciences, Indiana University, Bloomington, Indiana 47405, USA.

出版信息

J Acoust Soc Am. 2004 Aug;116(2):1224-33. doi: 10.1121/1.1768958.

DOI:10.1121/1.1768958

PMID:15376687

Abstract

The goal of this study was to establish the ability of normal-hearing listeners to discriminate formant frequency in vowels in everyday speech. Vowel formant discrimination in syllables, phrases, and sentences was measured for high-fidelity (nearly natural) speech synthesized by STRAIGHT [Kawahara et al., Speech Commun. 27, 187-207 (1999)]. Thresholds were measured for changes in F1 and F2 for the vowels /I, epsilon, ae, lambda/ in /bVd/ syllables. Experimental factors manipulated included phonetic context (syllables, phrases, and sentences), sentence discrimination with the addition of an identification task, and word position. Results showed that neither longer phonetic context nor the addition of the identification task significantly affected thresholds, while thresholds for word final position showed significantly better performance than for either initial or middle position in sentences. Results suggest that an average of 0.37 barks is required for normal-hearing listeners to discriminate vowel formants in modest length sentences, elevated by 84% compared to isolated vowels. Vowel formant discrimination in several phonetic contexts was slightly elevated for STRAIGHT-synthesized speech compared to formant-synthesized speech stimuli reported in the study by Kewley-Port and Zheng [J. Acoust. Soc. Am. 106, 2945-2958 (1999)]. These elevated thresholds appeared related to greater spectral-temporal variability for high-fidelity speech produced by STRAIGHT than for formant-synthesized speech.

摘要

本研究的目的是确定听力正常的听众在日常言语中辨别元音共振峰频率的能力。针对由STRAIGHT合成的高保真（近乎自然）语音，测量了音节、短语和句子中的元音共振峰辨别能力[河原等，《语音通信》27，187 - 207（1999）]。测量了/bVd/音节中元音/I、epsilon、ae、lambda/的F1和F2变化的阈值。所操控的实验因素包括语音语境（音节、短语和句子）、添加识别任务后的句子辨别以及单词位置。结果表明，较长的语音语境和添加识别任务均未显著影响阈值，而句子中单词末尾位置的阈值表现明显优于开头或中间位置。结果表明，听力正常的听众辨别适度长度句子中元音共振峰平均需要0.37 Bark，相较于孤立元音提高了84%。与凯利于 - 波特和郑的研究[《美国声学学会杂志》106，2945 - 2958（1999）]中所报道的共振峰合成语音刺激相比，STRAIGHT合成语音在几种语音语境下的元音共振峰辨别能力略有提高。这些升高的阈值似乎与STRAIGHT产生的高保真语音相比共振峰合成语音具有更大的频谱 - 时间变异性有关。