Carlin Michael A, Elhilali Mounya
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA.
IEEE/ACM Trans Audio Speech Lang Process. 2015 Dec;23(12):2422-2433. doi: 10.1109/TASLP.2015.2481179. Epub 2015 Sep 23.
One of the hallmarks of sound processing in the brain is the ability of the nervous system to adapt to changing behavioral demands and surrounding soundscapes. It can dynamically shift sensory and cognitive resources to focus on relevant sounds. Neurophysiological studies indicate that this ability is supported by adaptively retuning the shapes of cortical spectro-temporal receptive fields (STRFs) to enhance features of target sounds while suppressing those of task-irrelevant distractors. Because an important component of human communication is the ability of a listener to dynamically track speech in noisy environments, the solution obtained by auditory neurophysiology implies a useful adaptation strategy for speech activity detection (SAD). SAD is an important first step in a number of automated speech processing systems, and performance is often reduced in highly noisy environments. In this paper, we describe how task-driven adaptation is induced in an ensemble of neurophysiological STRFs, and show how speech-adapted STRFs reorient themselves to enhance spectro-temporal modulations of speech while suppressing those associated with a variety of nonspeech sounds. We then show how an adapted ensemble of STRFs can better detect speech in unseen noisy environments compared to an unadapted ensemble and a noise-robust baseline. Finally, we use a stimulus reconstruction task to demonstrate how the adapted STRF ensemble better captures the spectrotemporal modulations of attended speech in clean and noisy conditions. Our results suggest that a biologically plausible adaptation framework can be applied to speech processing systems to dynamically adapt feature representations for improving noise robustness.
大脑中声音处理的一个标志是神经系统能够适应不断变化的行为需求和周围的声景。它可以动态地转移感官和认知资源,以专注于相关声音。神经生理学研究表明,这种能力是通过自适应地调整皮层频谱-时间感受野(STRF)的形状来支持的,以增强目标声音的特征,同时抑制与任务无关的干扰声音的特征。由于人类交流的一个重要组成部分是听众在嘈杂环境中动态跟踪语音的能力,听觉神经生理学得出的解决方案意味着一种用于语音活动检测(SAD)的有用适应策略。SAD是许多自动语音处理系统中的重要第一步,并且在高噪声环境中性能通常会降低。在本文中,我们描述了如何在一组神经生理学STRF中诱导任务驱动的适应,并展示了适应语音的STRF如何重新定向自身,以增强语音的频谱-时间调制,同时抑制与各种非语音声音相关的调制。然后,我们展示了与未适应的组和抗噪声基线相比,适应的STRF组如何在未见的嘈杂环境中更好地检测语音。最后,我们使用刺激重建任务来证明适应的STRF组如何在干净和嘈杂条件下更好地捕捉被关注语音的频谱-时间调制。我们的结果表明,一个生物学上合理的适应框架可以应用于语音处理系统,以动态地调整特征表示,从而提高抗噪声能力。