Clonan Alex C, Zhai Xiu, Stevenson Ian H, Escabí Monty A
Electrical and Computer Engineering, University of Connecticut, Storrs, CT 06269.
Biomedical Engineering, University of Connecticut, Storrs, CT 06269.
bioRxiv. 2024 Oct 4:2024.02.13.579526. doi: 10.1101/2024.02.13.579526.
Recognizing speech in noise, such as in a busy restaurant, is an essential cognitive skill where the task difficulty varies across environments and noise levels. Although there is growing evidence that the auditory system relies on statistical representations for perceiving and coding natural sounds, it's less clear how statistical cues and neural representations contribute to segregating speech in natural auditory scenes. We demonstrate that human listeners rely on mid-level statistics to segregate and recognize speech in environmental noise. Using natural backgrounds and variants with perturbed spectro-temporal statistics, we show that speech recognition accuracy at a fixed noise level varies extensively across natural backgrounds (0% to 100%). Furthermore, for each background the unique interference created by summary statistics can mask or unmask speech, thus hindering or improving speech recognition. To identify the neural coding strategy and statistical cues that influence accuracy, we developed a framework that links summary statistics from a neural model to word recognition accuracy. Whereas a peripheral cochlear model accounts for only 60% of perceptual variance, summary statistics from a mid-level auditory midbrain model accurately predicts single trial sensory judgments, accounting for more than 90% of the perceptual variance. Furthermore, perceptual weights from the regression framework identify which statistics and tuned neural filters are influential and how they impact recognition. Thus, perception of speech in natural backgrounds relies on a mid-level auditory representation involving interference of multiple summary statistics that impact recognition beneficially or detrimentally across natural background sounds.
在嘈杂环境中识别语音,比如在繁忙的餐厅里,是一项重要的认知技能,其任务难度会因环境和噪音水平的不同而变化。尽管越来越多的证据表明,听觉系统依靠统计表征来感知和编码自然声音,但尚不清楚统计线索和神经表征如何在自然听觉场景中助力语音分离。我们证明,人类听众依靠中级统计信息在环境噪声中分离和识别语音。使用自然背景以及频谱-时间统计信息受到干扰的变体,我们发现,在固定噪声水平下,语音识别准确率在不同自然背景中差异很大(从0%到100%)。此外,对于每种背景,由汇总统计信息产生的独特干扰可能会掩盖或揭示语音,从而阻碍或提高语音识别。为了确定影响准确率的神经编码策略和统计线索,我们开发了一个框架,将神经模型的汇总统计信息与单词识别准确率联系起来。虽然外周耳蜗模型只能解释60%的感知差异,但中级听觉中脑模型的汇总统计信息能准确预测单次试验的感官判断,解释超过90%的感知差异。此外,回归框架中的感知权重能确定哪些统计信息和经过调谐的神经滤波器具有影响力,以及它们如何影响识别。因此,在自然背景中对语音的感知依赖于中级听觉表征,该表征涉及多个汇总统计信息的干扰,这些干扰在不同自然背景声音中对识别产生有益或有害的影响。