在不利条件下的言语可懂度建模。

Modelling speech intelligibility in adverse conditions.

机构信息

Department of Electrical Engineering, Technical University of Denmark, Lyngby, Denmark.

出版信息

Adv Exp Med Biol. 2013;787:343-51. doi: 10.1007/978-1-4614-1590-9_38.

Abstract

Jørgensen and Dau (J Acoust Soc Am 130:1475-1487, 2011) proposed the speech-based envelope power spectrum model (sEPSM) in an attempt to overcome the limitations of the classical speech transmission index (STI) and speech intelligibility index (SII) in conditions with nonlinearly processed speech. Instead of considering the reduction of the temporal modulation energy as the intelligibility metric, as assumed in the STI, the sEPSM applies the signal-to-noise ratio in the envelope domain (SNRenv). This metric was shown to be the key for predicting the intelligibility of reverberant speech as well as noisy speech processed by spectral subtraction. The key role of the SNRenv metric is further supported here by the ability of a short-term version of the sEPSM to predict speech masking release for different speech materials and modulated interferers. However, the sEPSM cannot account for speech subjected to phase jitter, a condition in which the spectral structure of the intelligibility of speech signal is strongly affected, while the broadband temporal envelope is kept largely intact. In contrast, the effects of this distortion can be predicted -successfully by the spectro-temporal modulation index (STMI) (Elhilali et al., Speech Commun 41:331-348, 2003), which assumes an explicit analysis of the spectral "ripple" structure of the speech signal. However, since the STMI applies the same decision metric as the STI, it fails to account for spectral subtraction. The results from this study suggest that the SNRenv might reflect a powerful decision metric, while some explicit across-frequency analysis seems crucial in some conditions. How such across-frequency analysis is "realized" in the auditory system remains unresolved.

摘要

Jørgensen 和 Dau（J Acoust Soc Am 130:1475-1487, 2011）提出了基于语音的包络功率谱模型（sEPSM），试图克服经典语音传输指数（STI）和语音可懂度指数（SII）在语音非线性处理条件下的局限性。与 STI 中假设的将时间调制能量的减少作为可懂度度量不同，sEPSM 应用包络域中的信噪比（SNRenv）。该度量标准被证明是预测混响语音以及经频谱减法处理的噪声语音可懂度的关键。sEPSM 的短期版本能够预测不同语音材料和调制干扰器的语音掩蔽释放，这进一步支持了 SNRenv 度量标准的关键作用。然而，sEPSM 无法解释受到相位抖动影响的语音，在这种情况下，语音信号的可懂度的频谱结构受到强烈影响，而宽带时间包络基本保持完整。相比之下，这种失真的影响可以通过频谱时间调制指数（STMI）（Elhilali 等人，Speech Commun 41:331-348, 2003）成功预测，该模型假设对语音信号的频谱“波纹”结构进行明确分析。然而，由于 STMI 应用与 STI 相同的决策度量标准，它无法解释频谱减法。本研究的结果表明，SNRenv 可能反映了一种强大的决策度量标准，而在某些条件下，一些明确的跨频分析似乎至关重要。这种跨频分析如何在听觉系统中“实现”仍未解决。