Mirror Neurons and Interaction Lab, Robotics, Brain and Cognitive Sciences Department, Istituto Italiano di Tecnologia Genova, Italy.
Front Psychol. 2013 Jun 27;4:364. doi: 10.3389/fpsyg.2013.00364. Print 2013.
Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR) side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR are mainly evident in the realistic use of such systems. These limitations can be partly reduced by using normalization strategies that minimize inter-speaker variability by either explicitly removing speakers' peculiarities or adapting different speakers to a reference model. In this paper we aim at modeling a motor-based imitation learning mechanism in ASR. We tested the utility of a speaker normalization strategy that uses motor representations of speech and compare it with strategies that ignore the motor domain. Specifically, we first trained a regressor through state-of-the-art machine learning techniques to build an auditory-motor mapping, in a sense mimicking a human learner that tries to reproduce utterances produced by other speakers. This auditory-motor mapping maps the speech acoustics of a speaker into the motor plans of a reference speaker. Since, during recognition, only speech acoustics are available, the mapping is necessary to "recover" motor information. Subsequently, in a phone classification task, we tested the system on either one of the speakers that was used during training or a new one. Results show that in both cases the motor-based speaker normalization strategy slightly but significantly outperforms all other strategies where only acoustics is taken into account.
传统的语音模型将感知和生成功能区分在前后两个部分。然而,经颅磁刺激对言语运动中心的神经活动进行选择性改变,结果表明它会影响言语辨别。在自动语音识别 (ASR) 方面,识别系统传统上仅依赖声学数据,在最佳听力条件下取得了相当好的性能。当前 ASR 的主要局限性主要体现在这些系统的实际使用中。通过使用标准化策略可以部分减少这些局限性,这些策略通过显式去除说话者的特征或使不同的说话者适应参考模型来最小化说话者之间的可变性。在本文中,我们旨在为 ASR 中建模一种基于运动的模仿学习机制。我们测试了一种使用语音运动表示的说话者标准化策略的效用,并将其与忽略运动域的策略进行了比较。具体来说,我们首先通过最先进的机器学习技术训练回归器来建立听觉-运动映射,在某种意义上模仿了一个试图模仿其他说话者发音的人类学习者。该听觉-运动映射将说话者的语音声学特征映射到参考说话者的运动计划中。由于在识别过程中仅可获得语音声学特征,因此需要该映射来“恢复”运动信息。随后,在电话分类任务中,我们在训练中使用的其中一个说话者或新的说话者上测试了系统。结果表明,在这两种情况下,基于运动的说话者标准化策略都略微但显著地优于仅考虑声学的所有其他策略。