Fraunhofer IDMT, Hearing, Speech and Audio Technology and Cluster of Excellence Hearing4all, Marie-Curie-Str. 2, 26129 Oldenburg, Germany.
Department für Medizinische Physik und Akustik, Carl von Ossietzky Universität Oldenburg and Cluster of Excellence Hearing4all, Oldenburg, Germany.
Hear Res. 2022 Dec;426:108598. doi: 10.1016/j.heares.2022.108598. Epub 2022 Aug 8.
Speech perception is strongly affected by noise and reverberation in the listening room, and binaural processing can substantially facilitate speech perception in conditions when target speech and maskers originate from different directions. Most studies and proposed models for predicting spatial unmasking have focused on speech intelligibility. The present study introduces a model framework that predicts both speech intelligibility and perceived listening effort from the same output measure. The framework is based on a combination of a blind binaural processing stage employing a blind equalization cancelation (EC) mechanism, and a blind backend based on phoneme probability classification. Neither frontend nor backend require any additional information, such as the source directions, the signal-to-noise ratio (SNR), or the number of sources, allowing for a fully blind perceptual assessment of binaural input signals consisting of target speech mixed with noise. The model is validated against a recent data set in which speech intelligibility and perceived listening effort were measured for a range of acoustic conditions differing in reverberation and binaural cues [Rennies and Kidd (2018), J. Acoust. Soc. Am. 144, 2147-2159]. Predictions of the proposed model are compared with a non-blind binaural model consisting of a non-blind EC stage and a backend based on the speech intelligibility index. The analyses indicated that all main trends observed in the experiments were correctly predicted by the blind model. The overall proportion of variance explained by the model (R² = 0.94) for speech intelligibility was slightly worse than for the non-blind model (R² = 0.98). For listening effort predictions, both models showed lower prediction accuracy, but still explained significant proportions of the observed variance (R² = 0.88 and R² = 0.71 for the non-blind and blind model, respectively). Closer inspection showed that the differences between data and predictions were largest for binaural conditions at high SNRs, where the perceived listening effort of human listeners tended to be underestimated by the models, specifically by the blind version.
言语感知强烈受到聆听室内噪声和混响的影响,双耳处理在目标语音和掩蔽声来自不同方向的情况下,可以显著促进言语感知。大多数用于预测空间掩蔽的研究和提出的模型都集中在言语可懂度上。本研究介绍了一种模型框架,该框架可以从相同的输出度量中预测言语可懂度和感知聆听努力度。该框架基于结合使用盲双耳处理阶段和基于盲语音概率分类的后端,该盲双耳处理阶段采用盲均衡抵消(EC)机制。前端和后端都不需要任何其他信息,例如源方向、信号噪声比(SNR)或源数量,从而可以对由与噪声混合的目标语音组成的双耳输入信号进行完全盲感知评估。该模型针对最近的一个数据集进行了验证,该数据集在不同混响和双耳线索的声环境下测量了言语可懂度和感知聆听努力度[Rennies 和 Kidd(2018),J. Acoust. Soc. Am. 144,2147-2159]。与包含盲 EC 阶段和基于言语可懂度指数的后端的非盲双耳模型相比,对所提出模型的预测进行了比较。分析表明,盲模型正确预测了实验中观察到的所有主要趋势。模型对言语可懂度的总体方差解释比例(R²=0.94)略低于非盲模型(R²=0.98)。对于聆听努力度的预测,两个模型的预测精度都较低,但仍解释了观察到的方差的很大比例(非盲模型和盲模型分别为 R²=0.88 和 R²=0.71)。更仔细的检查表明,数据和预测之间的差异在高 SNR 的双耳条件下最大,在这些条件下,模型对人类聆听者的感知聆听努力度的估计往往偏低,特别是盲模型。