IEEE Trans Cybern. 2017 Dec;47(12):4235-4249. doi: 10.1109/TCYB.2016.2603146. Epub 2016 Sep 19.
Speaker identification plays a crucial role in biometric person identification as systems based on human speech are increasingly used for the recognition of people. Mel frequency cepstral coefficients (MFCCs) have been widely adopted for decades in speech processing to capture the speech-specific characteristics with a reduced dimensionality. However, although their ability to decorrelate the vocal source and the vocal tract filter make them suitable for speech recognition, they greatly mitigate the speaker variability, a specific characteristic that distinguishes different speakers. This paper presents a theoretical framework and an experimental evaluation showing that reducing the dimension of features by applying the discrete Karhunen-Loève transform (DKLT) to the log-spectrum of the speech signal guarantees better performance compared to conventional MFCC features. In particular with short sequences of speech frames, with typical duration of less than 2 s, the performance of truncated DKLT representation achieved for the identification of five speakers are always better than those achieved with the MFCCs for the experiments we performed. Additionally, the framework was tested on up to 100 TIMIT speakers with sequences of less than 3.5 s showing very good recognition capabilities.
说话人识别在生物特征身份识别中起着至关重要的作用,因为基于人类语音的系统越来越多地用于识别人。梅尔频率倒谱系数 (MFCC) 在语音处理中已被广泛采用数十年,以具有降低维数的方式捕获特定于语音的特征。然而,尽管它们具有解相关语音源和声道滤波器的能力,使它们适合语音识别,但它们大大减轻了说话者的可变性,这是区分不同说话者的特定特征。本文提出了一个理论框架和实验评估,表明通过将离散 Karhunen-Loève 变换 (DKLT) 应用于语音信号的对数频谱,来降低特征的维度,可确保比传统 MFCC 特征更好的性能。特别是对于短的语音帧序列,典型持续时间小于 2 秒,与我们进行的实验中 MFCC 相比,截断的 DKLT 表示的性能对于五个说话者的识别总是更好。此外,该框架在 100 个 TIMIT 说话者的不到 3.5 秒的序列上进行了测试,显示出非常好的识别能力。