Bellur Ashwin, Elhilali Mounya
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA.
IEEE/ACM Trans Audio Speech Lang Process. 2017 Mar;25(3):481-492. doi: 10.1109/TASLP.2016.2639322. Epub 2016 Dec 13.
Parsing natural acoustic scenes using computational methodologies poses many challenges. Given the rich and complex nature of the acoustic environment, data mismatch between train and test conditions is a major hurdle in data-driven audio processing systems. In contrast, the brain exhibits a remarkable ability at segmenting acoustic scenes with relative ease. When tackling challenging listening conditions that are often faced in everyday life, the biological system relies on a number of principles that allow it to effortlessly parse its rich soundscape. In the current study, we leverage a key principle employed by the auditory system: its ability to adapt the neural representation of its sensory input in a high-dimensional space. We propose a framework that mimics this process in a computational model for robust speech activity detection. The system employs a 2-D Gabor filter bank whose parameters are retuned offline to improve the separability between the feature representation of speech and nonspeech sounds. This retuning process, driven by feedback from statistical models of speech and nonspeech classes, attempts to minimize the misclassification risk of mismatched data, with respect to the original statistical models. We hypothesize that this risk minimization procedure results in an emphasis of unique speech and nonspeech modulations in the high-dimensional space. We show that such an adapted system is indeed robust to other novel conditions, with a marked reduction in equal error rates for a variety of databases with additive and convolutive noise distortions. We discuss the lessons learned from biology with regard to adapting to an ever-changing acoustic environment and the impact on building truly intelligent audio processing systems.
使用计算方法解析自然声学场景面临诸多挑战。鉴于声学环境丰富且复杂的特性,训练和测试条件之间的数据不匹配是数据驱动音频处理系统中的一个主要障碍。相比之下,大脑在相对轻松地分割声学场景方面展现出非凡能力。在应对日常生活中经常遇到的具有挑战性的聆听条件时,生物系统依赖于一些原则,使其能够毫不费力地解析其丰富的音景。在当前研究中,我们利用了听觉系统所采用的一个关键原则:其在高维空间中调整感觉输入神经表征的能力。我们提出了一个框架,在用于稳健语音活动检测的计算模型中模仿这一过程。该系统采用二维伽柏滤波器组,其参数离线重新调整,以提高语音和非语音声音特征表示之间的可分离性。这个重新调整过程由语音和非语音类别的统计模型反馈驱动,试图相对于原始统计模型将不匹配数据的误分类风险降至最低。我们假设这种风险最小化过程会导致在高维空间中突出独特的语音和非语音调制。我们表明,这样一个经过调整的系统确实对其他新条件具有鲁棒性,对于各种具有加性和卷积噪声失真的数据库,其等错误率显著降低。我们讨论了从生物学中学到的关于适应不断变化的声学环境的经验教训以及对构建真正智能音频处理系统的影响。