Nemala Sridhar Krishna, Patil Kailash, Elhilali Mounya
Department of Electrical and Computer Engineering, Center for Language and Speech Processing, Johns Hopkins University, 3400 N Charles Street, Barton Hall, Rm 105, Baltimore, MD USA.
Int J Speech Technol. 2013;16(3):313-322. doi: 10.1007/s10772-012-9184-y. Epub 2012 Dec 18.
Humans are quite adept at communicating in presence of noise. However most speech processing systems, like automatic speech and speaker recognition systems, suffer from a significant drop in performance when speech signals are corrupted with unseen background distortions. The proposed work explores the use of a biologically-motivated multi-resolution spectral analysis for speech representation. This approach focuses on the information-rich spectral attributes of speech and presents an intricate yet computationally-efficient analysis of the speech signal by careful choice of model parameters. Further, the approach takes advantage of an information-theoretic analysis of the message and speaker dominant regions in the speech signal, and defines feature representations to address two diverse tasks such as speech and speaker recognition. The proposed analysis surpasses the standard Mel-Frequency Cepstral Coefficients (MFCC), and its enhanced variants (via mean subtraction, variance normalization and time sequence filtering) and yields significant improvements over a state-of-the-art noise robust feature scheme, on both speech and speaker recognition tasks.
人类在有噪声的情况下相当擅长交流。然而,大多数语音处理系统,如自动语音和说话人识别系统,当语音信号被未知的背景失真干扰时,性能会显著下降。所提出的工作探索了使用一种受生物启发的多分辨率频谱分析来进行语音表示。这种方法关注语音中信息丰富的频谱属性,并通过精心选择模型参数,对语音信号进行复杂但计算高效的分析。此外,该方法利用了对语音信号中消息和说话人主导区域的信息理论分析,并定义特征表示以解决语音和说话人识别等两个不同的任务。所提出的分析超越了标准的梅尔频率倒谱系数(MFCC)及其增强变体(通过均值减法、方差归一化和时间序列滤波),并且在语音和说话人识别任务上,相对于一种先进的抗噪声特征方案都有显著改进。