Edraki Amin, Chan Wai-Yip, Jensen Jesper, Fogerty Daniel
Department of Electrical and Computer Engineering, Queen's University, Kingston, ON K7L 3N6, Canada.
Department of Electronic Systems, Aalborg University, 9220 Aalborg, Denmark.
IEEE/ACM Trans Audio Speech Lang Process. 2021;29:210-225. doi: 10.1109/taslp.2020.3039929. Epub 2020 Nov 24.
Spectro-temporal modulations are believed to mediate the analysis of speech sounds in the human primary auditory cortex. Inspired by humans' robustness in comprehending speech in challenging acoustic environments, we propose an intrusive speech intelligibility prediction (SIP) algorithm, wSTMI, for normal-hearing listeners based on spectro-temporal modulation analysis (STMA) of the clean and degraded speech signals. In the STMA, each of 55 modulation frequency channels contributes an intermediate intelligibility measure. A sparse linear model with parameters optimized using Lasso regression results in combining the intermediate measures of 8 of the most salient channels for SIP. In comparison with a suite of 10 SIP algorithms, wSTMI performs consistently well across 13 datasets, which together cover degradation conditions including modulated noise, noise reduction processing, reverberation, near-end listening enhancement, and speech interruption. We show that the optimized parameters of wSTMI may be interpreted in terms of modulation transfer functions of the human auditory system. Thus, the proposed approach offers evidence affirming previous studies of the perceptual characteristics underlying speech signal intelligibility.
频谱-时间调制被认为介导了人类初级听觉皮层中语音声音的分析。受人类在具有挑战性的声学环境中理解语音的稳健性启发,我们基于对纯净语音信号和降级语音信号的频谱-时间调制分析(STMA),为听力正常的听众提出了一种侵入式语音可懂度预测(SIP)算法,即加权频谱-时间调制指数(wSTMI)。在STMA中,55个调制频率通道中的每一个都贡献一个中间可懂度度量。使用套索回归优化参数的稀疏线性模型,可将8个最显著通道的中间度量组合起来用于SIP。与一组10种SIP算法相比,wSTMI在13个数据集上的表现始终良好,这些数据集共同涵盖了包括调制噪声、降噪处理、混响、近端听力增强和语音中断在内的降级条件。我们表明,wSTMI的优化参数可以根据人类听觉系统的调制传递函数来解释。因此,所提出的方法为先前关于语音信号可懂度基础感知特征的研究提供了证据支持。