Department of Electrical Engineering, University of Texas at Dallas, Richardson, Texas 75080-3021, USA.
Ear Hear. 2011 May-Jun;32(3):331-8. doi: 10.1097/AUD.0b013e3181ff3515.
The purpose of this study is to evaluate the performance of a number of speech intelligibility indices in terms of predicting the intelligibility of vocoded speech.
Noise-corrupted sentences were vocoded in a total of 80 conditions, involving three different signal-to-noise ratio levels (-5, 0, and 5 dB) and two types of maskers (steady state noise and two-talker). Tone-vocoder simulations and combined electric-acoustic stimulation (EAS) simulations were used. The vocoded sentences were presented to normal-hearing listeners for identification, and the resulting intelligibility scores were used to assess the correlation of various speech intelligibility measures. These included measures designed to assess speech intelligibility, including the speech transmission index (STI) and articulation index based measures, as well as distortions in hearing aids (e.g., coherence-based measures). These measures employed primarily either the temporal-envelope or the spectral-envelope information in the prediction model. The underlying hypothesis in the present study is that measures that assess temporal-envelope distortions, such as those based on the STI, should correlate highly with the intelligibility of vocoded speech. This is based on the fact that vocoder simulations preserve primarily envelope information, similar to the processing implemented in current cochlear implant speech processors. Similarly, it is hypothesized that measures such as the coherence-based index that assess the distortions present in the spectral envelope could also be used to model the intelligibility of vocoded speech.
Of all the intelligibility measures considered, the coherence-based and the STI-based measures performed the best. High correlations (r = 0.9 to 0.96) were maintained with the coherence-based measures in all noisy conditions. The highest correlation obtained with the STI-based measure was 0.92, and that was obtained when high modulation rates (100 Hz) were used. The performance of these measures remained high in both steady-noise and fluctuating masker conditions. The correlations with conditions involving tone-vocoded speech were found to be a bit higher than the correlations with conditions involving EAS-vocoded speech.
The present study demonstrated that some of the speech intelligibility indices that have been found previously to correlate highly with wideband speech can also be used to predict the intelligibility of vocoded speech. Both the coherence-based and STI-based measures have been found to be good measures for modeling the intelligibility of vocoded speech. The highest correlation (r = 0.96) was obtained with a derived coherence measure that placed more emphasis on information contained in vowel/consonant spectral transitions and less emphasis on information contained in steady sonorant segments. High (100 Hz) modulation rates were found to be necessary in the implementation of the STI-based measures for better modeling of the intelligibility of vocoded speech. We believe that the difference in modulation rates needed for modeling the intelligibility of wideband versus vocoded speech can be attributed to the increased importance of higher modulation rates in situations where the amount of spectral information available to the listeners is limited (eight channels in our study). Unlike the traditional STI method that has been found to perform poorly in terms of predicting the intelligibility of processed speech wherein nonlinear operations are involved, the STI-based measure used in the present study has been found to perform quite well. In summary, the present study took the first step in modeling the intelligibility of vocoded speech. Access to such intelligibility measures is of high significance as they can be used to guide the development of new speech coding algorithms for cochlear implants.
本研究旨在评估一系列语音可懂度指标在预测变码语音可懂度方面的性能。
在总共 80 种条件下对噪声污染的句子进行变码处理,涉及三种不同的信噪比水平(-5、0 和 5dB)和两种掩蔽器(稳态噪声和双说话人)。使用了音调变码器模拟和组合电声刺激(EAS)模拟。将变码句子呈现给正常听力听众进行识别,所得可懂度得分用于评估各种语音可懂度测量的相关性。这些措施包括旨在评估语音可懂度的措施,包括语音传输指数(STI)和基于发音的测量,以及助听器中的失真(例如,基于相干性的测量)。这些措施在预测模型中主要使用了时间包络或频谱包络信息。本研究的基本假设是,评估时间包络失真的措施,如基于 STI 的措施,应该与变码语音的可懂度高度相关。这是基于这样一个事实,即变码器模拟主要保留包络信息,类似于当前耳蜗植入语音处理器中实现的处理。同样,假设评估频谱包络中存在的失真的基于相干性的指标等措施也可以用于对变码语音的可懂度进行建模。
在所考虑的所有可懂度测量中,基于相干性和基于 STI 的测量表现最好。在所有有噪声的条件下,与基于相干性的测量都保持了高相关性(r=0.9 到 0.96)。与基于 STI 的测量获得的最高相关性为 0.92,这是在使用高调制率(100Hz)时获得的。这些措施在稳态噪声和波动掩蔽条件下的性能仍然很高。与涉及音调变码语音的条件相比,与涉及 EAS 变码语音的条件的相关性略高。
本研究表明,先前发现与宽带语音高度相关的一些语音可懂度指标也可用于预测变码语音的可懂度。基于相干性和基于 STI 的测量都被发现是对变码语音可懂度进行建模的良好指标。与基于 EAS 的变码语音条件相比,与基于音调变码语音条件的相关性略高。获得的最高相关性(r=0.96)是通过对一个衍生的相干测量得到的,该测量更强调元音/辅音频谱转换中包含的信息,而较少强调稳态共鸣段中包含的信息。在实施基于 STI 的测量时,需要较高的(100Hz)调制率,以便更好地对变码语音的可懂度进行建模。我们认为,建模宽带语音与变码语音的可懂度所需的调制率差异可以归因于在听众可用频谱信息有限的情况下(在我们的研究中为八个通道)更高调制率的重要性增加。与传统的 STI 方法不同,传统的 STI 方法在涉及非线性操作的处理语音的可懂度预测方面表现不佳,本研究中使用的基于 STI 的测量方法表现相当好。总之,本研究在变码语音可懂度建模方面迈出了第一步。获得这种可懂度测量值具有重要意义,因为它们可以用于指导新的耳蜗植入语音编码算法的开发。