Nemala Sridhar Krishna, Patil Kailash, Elhilali Mounya
The authors are with the Department of Electrical and Computer Engineering, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD 21218 USA.
IEEE Trans Audio Speech Lang Process. 2013 Feb;21(2):416-426. doi: 10.1109/TASL.2012.2219526. Epub 2012 Sep 18.
There is strong neurophysiological evidence suggesting that processing of speech signals in the brain happens along parallel paths which encode complementary information in the signal. These parallel streams are organized around a duality of slow vs. fast: Coarse signal dynamics appear to be processed separately from rapidly changing modulations both in the spectral and temporal dimensions. We adapt such duality in a multistream framework for robust speaker-independent phoneme recognition. The scheme presented here centers around a multi-path bandpass modulation analysis of speech sounds with each stream covering an entire range of temporal and spectral modulations. By performing bandpass operations along the spectral and temporal dimensions, the proposed scheme avoids the classic feature explosion problem of previous multistream approaches while maintaining the advantage of parallelism and localized feature analysis. The proposed architecture results in substantial improvements over standard and state-of-the-art feature schemes for phoneme recognition, particularly in presence of nonstationary noise, reverberation and channel distortions.
有强有力的神经生理学证据表明,大脑中语音信号的处理是沿着并行路径进行的,这些路径对信号中的互补信息进行编码。这些并行流围绕着慢与快的二元性组织起来:粗略的信号动态似乎与频谱和时间维度上快速变化的调制分别进行处理。我们在多流框架中采用这种二元性来实现强大的与说话者无关的音素识别。这里提出的方案以语音的多路径带通调制分析为中心,每个流覆盖整个时间和频谱调制范围。通过在频谱和时间维度上执行带通操作,所提出的方案避免了先前多流方法中经典的特征爆炸问题,同时保持了并行性和局部特征分析的优势。所提出的架构在音素识别方面比标准和最先进的特征方案有显著改进,特别是在存在非平稳噪声、混响和信道失真的情况下。