Irino Toshio, Patterson Roy D, Kawahara Hideki
Faculty of Systems Engineering, Wakayama University, Wakayama 640-8510, Japan
IEEE Trans Audio Speech Lang Process. 2006 Nov;14(6):2212-2221. doi: 10.1109/TASL.2006.872611.
We propose a new method to segregate concurrent speech sounds using an auditory version of a channel vocoder. The auditory representation of sound, referred to as an "auditory image," preserves fine temporal information, unlike conventional window-based processing systems. This makes it possible to segregate speech sources with an event synchronous procedure. Fundamental frequency information is used to estimate the sequence of glottal pulse times for a target speaker, and to repress the glottal events of other speakers. The procedure leads to robust extraction of the target speech and effective segregation even when the signal-to-noise ratio is as low as 0 dB. Moreover, the segregation performance remains high when the speech contains jitter, or when the estimate of the fundamental frequency F0 is inaccurate. This contrasts with conventional comb-filter methods where errors in F0 estimation produce a marked reduction in performance. We compared the new method to a comb-filter method using a cross-correlation measure and perceptual recognition experiments. The results suggest that the new method has the potential to supplant comb-filter and harmonic-selection methods for speech enhancement.
我们提出了一种使用声道声码器的听觉版本来分离并发语音声音的新方法。声音的听觉表征,即所谓的“听觉图像”,与传统的基于窗口的处理系统不同,它保留了精细的时间信息。这使得通过事件同步过程来分离语音源成为可能。基频信息用于估计目标说话者的声门脉冲时间序列,并抑制其他说话者的声门事件。即使在信噪比低至0 dB时,该过程也能可靠地提取目标语音并有效分离。此外,当语音包含抖动,或者基频F0的估计不准确时,分离性能仍然很高。这与传统的梳状滤波器方法形成对比,在传统方法中,F0估计中的误差会导致性能显著下降。我们使用互相关测量和感知识别实验将新方法与梳状滤波器方法进行了比较。结果表明,新方法有潜力取代梳状滤波器和谐波选择方法用于语音增强。