Phonetics Lab, Linguistics Department, University of California-Davis, Davis, California 95616, USA.
Laboratoire de Phonétique et Phonologie, Université Sorbonne Nouvelle, UMR 7018 CNRS, Paris, France.
J Acoust Soc Am. 2024 Jul 1;156(1):489-502. doi: 10.1121/10.0027932.
Anticipatory coarticulation is a highly informative cue to upcoming linguistic information: listeners can identify that the word is ben and not bed by hearing the vowel alone. The present study compares the relative performances of human listeners and a self-supervised pre-trained speech model (wav2vec 2.0) in the use of nasal coarticulation to classify vowels. Stimuli consisted of nasalized (from CVN words) and non-nasalized (from CVCs) American English vowels produced by 60 humans and generated in 36 TTS voices. wav2vec 2.0 performance is similar to human listener performance, in aggregate. Broken down by vowel type: both wav2vec 2.0 and listeners perform higher for non-nasalized vowels produced naturally by humans. However, wav2vec 2.0 shows higher correct classification performance for nasalized vowels, than for non-nasalized vowels, for TTS voices. Speaker-level patterns reveal that listeners' use of coarticulation is highly variable across talkers. wav2vec 2.0 also shows cross-talker variability in performance. Analyses also reveal differences in the use of multiple acoustic cues in nasalized vowel classifications across listeners and the wav2vec 2.0. Findings have implications for understanding how coarticulatory variation is used in speech perception. Results also can provide insight into how neural systems learn to attend to the unique acoustic features of coarticulation.
预期协同发音是一个非常有用的线索,可以预测即将到来的语言信息:听众只需听到元音就能识别出这个词是 ben,而不是 bed。本研究比较了人类听众和自我监督的预训练语音模型(wav2vec 2.0)在使用鼻腔协同发音对元音进行分类方面的相对表现。刺激物由美国人发的鼻音化(来自 CVN 词)和非鼻音化(来自 CVCs)元音组成,由 60 个人发出,并由 36 个 TTS 声音生成。wav2vec 2.0 的性能与人类听众的表现相似,总体而言。按元音类型细分:wav2vec 2.0 和听众对人类自然产生的非鼻音化元音的分类表现都更高。然而,wav2vec 2.0 对 TTS 声音的鼻音化元音的正确分类表现高于非鼻音化元音。说话人层面的模式表明,听众对协同发音的使用在不同说话人之间具有高度的可变性。wav2vec 2.0 也表现出在性能方面的说话人可变性。分析还揭示了在鼻音化元音分类中,听众和 wav2vec 2.0 对多个声学线索的使用存在差异。研究结果对理解协同发音变异在言语感知中的作用具有启示意义。结果还可以深入了解神经系统如何学会关注协同发音的独特声学特征。