Mamun Nursadul, Ghosh Ria, Hansen John H L
CRSS: Center for Robust Speech Systems; Cochlear Implant Processing Laboratory (CILab) Department of Electrical and Computer Engineering, University of Texas at Dallas, USA.
Interspeech. 2019 Sep;2019:3118-3122. doi: 10.21437/interspeech.2019-1852.
Speaker recognition is a biometric modality that uses underlying speech information to determine the identity of the speaker. Speaker Identification (SID) under noisy conditions is one of the challenging topics in the field of speech processing, specifically when it comes to individuals with cochlear implants (CI). This study analyzes and quantifies the ability of CI-users to perform speaker identification based on direct electric auditory stimuli. CI users employ a limited number of frequency bands (8 ∼ 22) and use electrodes to directly stimulate the Basilar Membrane/Cochlear in order to recognize the speech signal. The sparsity of electric stimulation within the CI frequency range is a prime reason for loss in human speech recognition, as well as SID performance. Therefore, it is assumed that CI-users might be unable to recognize and distinguish a speaker given dependent information such as formant frequencies, pitch etc. which are lost to un-simulated electrodes. To quantify this assumption, the input speech signal is processed using a CI Advanced Combined Encoder (ACE) signal processing strategy to construct the CI auditory electrodogram. The proposed study uses 50 speakers from each of three different databases for training the system using two different classifiers under quiet, and tested under both quiet and noisy conditions. The objective result shows that, the CI users can effectively identify a limited number of speakers. However, their performance decreases when more speakers are added in the system, as well as when noisy conditions are introduced. This information could therefore be used for improving CI-user signal processing techniques to improve human SID.
说话人识别是一种生物特征识别方式,它利用潜在的语音信息来确定说话人的身份。在噪声环境下的说话人识别(SID)是语音处理领域中具有挑战性的课题之一,尤其是对于佩戴人工耳蜗(CI)的个体而言。本研究分析并量化了CI用户基于直接电听觉刺激进行说话人识别的能力。CI用户使用有限数量的频段(8~22),并通过电极直接刺激基底膜/耳蜗来识别语音信号。CI频率范围内电刺激的稀疏性是人类语音识别以及SID性能下降的主要原因。因此,可以假设CI用户可能无法根据诸如共振峰频率、音高等依赖信息来识别和区分说话人,而这些信息对于未模拟的电极来说是丢失的。为了量化这一假设,使用CI高级组合编码器(ACE)信号处理策略对输入语音信号进行处理,以构建CI听觉电图。本研究使用来自三个不同数据库的50名说话人,在安静环境下使用两种不同的分类器对系统进行训练,并在安静和噪声环境下进行测试。客观结果表明,CI用户能够有效地识别有限数量的说话人。然而,当系统中添加更多说话人以及引入噪声环境时,他们的性能会下降。因此,这些信息可用于改进CI用户的信号处理技术,以提高人类的SID能力。