Ernst Strüngmann Institute (ESI) for Neuroscience in Cooperation with Max Planck Society, Frankfurt, Germany; University of Bristol, School of Psychological Science, Bristol, United Kingdom.
University of Bristol, School of Psychological Science, Bristol, United Kingdom.
Neural Netw. 2023 May;162:199-211. doi: 10.1016/j.neunet.2023.02.032. Epub 2023 Feb 24.
Natural and artificial audition can in principle acquire different solutions to a given problem. The constraints of the task, however, can nudge the cognitive science and engineering of audition to qualitatively converge, suggesting that a closer mutual examination would potentially enrich artificial hearing systems and process models of the mind and brain. Speech recognition - an area ripe for such exploration - is inherently robust in humans to a number transformations at various spectrotemporal granularities. To what extent are these robustness profiles accounted for by high-performing neural network systems? We bring together experiments in speech recognition under a single synthesis framework to evaluate state-of-the-art neural networks as stimulus-computable, optimized observers. In a series of experiments, we (1) clarify how influential speech manipulations in the literature relate to each other and to natural speech, (2) show the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, (3) identify the specific conditions where model predictions of human performance differ, and (4) demonstrate a crucial failure of all artificial systems to perceptually recover where humans do, suggesting alternative directions for theory and model building. These findings encourage a tighter synergy between the cognitive science and engineering of audition.
自然听觉和人工听觉原则上可以为给定问题提供不同的解决方案。然而,任务的约束条件可能会促使听觉的认知科学和工程学发生质的趋同,这表明更密切的相互检查将有可能丰富人工听觉系统以及心智和大脑的过程模型。语音识别——一个非常适合进行这种探索的领域——在人类中对各种频谱和时间粒度的变换具有内在的鲁棒性。这些鲁棒性特征在多大程度上可以通过高性能神经网络系统来解释?我们将语音识别实验整合到一个单一的综合框架中,以评估最先进的神经网络作为可刺激计算的、经过优化的观测器。在一系列实验中,我们:(1)阐明文献中不同的语音处理方法之间如何相互关联,以及与自然语音的关系;(2)展示机器在哪些粒度上表现出分布外鲁棒性,再现人类的经典感知现象;(3)确定模型对人类表现的预测存在差异的具体条件;(4)证明所有人工系统在人类能够感知到的地方都无法进行感知恢复,这表明需要为理论和模型构建寻找替代方向。这些发现鼓励听觉的认知科学和工程学之间建立更紧密的协同关系。