School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Tamilnadu, 600127, India.
Department of Information Technology, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Tamil Nadu, 603110, India.
Sci Rep. 2024 May 31;14(1):12513. doi: 10.1038/s41598-024-62406-3.
Speech is produced by a nonlinear, dynamical Vocal Tract (VT) system, and is transmitted through multiple (air, bone and skin conduction) modes, as captured by the air, bone and throat microphones respectively. Speaker specific characteristics that capture this nonlinearity are rarely used as stand-alone features for speaker modeling, and at best have been used in tandem with well known linear spectral features to produce tangible results. This paper proposes Recurrent Plot (RP) embeddings as stand-alone, non-linear speaker-discriminating features. Two datasets, the continuous multimodal TIMIT speech corpus and the consonant-vowel unimodal syllable dataset, are used in this study for conducting closed-set speaker identification experiments. Experiments with unimodal speaker recognition systems show that RP embeddings capture the nonlinear dynamics of the VT system which are unique to every speaker, in all the modes of speech. The Air (A), Bone (B) and Throat (T) microphone systems, trained purely on RP embeddings perform with an accuracy of 95.81%, 98.18% and 99.74%, respectively. Experiments using the joint feature space of combined RP embeddings for bimodal (A-T, A-B, B-T) and trimodal (A-B-T) systems show that the best trimodal system (99.84% accuracy) performs on par with trimodal systems using spectrogram (99.45%) and MFCC (99.98%). The 98.84% performance of the B-T bimodal system shows the efficacy of a speaker recognition system based entirely on alternate (bone and throat) speech, in the absence of the standard (air) speech. The results underscore the significance of the RP embedding, as a nonlinear feature representation of the dynamical VT system that can act independently for speaker recognition. It is envisaged that speech recognition too will benefit from this nonlinear feature.
语音是由非线性动力声道系统产生的,并通过空气、骨导和皮导等多种模态进行传输,分别由空气、骨导和喉麦捕捉。捕获这种非线性的说话人特有的特征很少被用作说话人建模的独立特征,最好是与知名的线性谱特征一起使用,以产生切实的结果。本文提出了递归图(RP)嵌入作为独立的、非线性的说话人区分特征。在这项研究中,使用了连续多模态 TIMIT 语音语料库和辅音-元音单模态音节数据集进行闭集说话人识别实验。使用单模态说话人识别系统的实验表明,RP 嵌入捕获了每个说话人独有的声道系统的非线性动力学,在所有语音模态中都是如此。仅基于 RP 嵌入训练的空气(A)、骨导(B)和喉麦(T)麦克风系统的准确率分别为 95.81%、98.18%和 99.74%。使用联合特征空间(A-T、A-B、B-T 和 A-B-T)的双模态和三模态(A-B-T)系统的实验表明,性能最佳的三模态系统(准确率为 99.84%)与使用语谱图(99.45%)和 MFCC(99.98%)的三模态系统相当。B-T 双模态系统 98.84%的性能表明,基于交替(骨导和喉麦)语音的说话人识别系统在没有标准(空气)语音的情况下具有一定的效果。这些结果强调了 RP 嵌入作为声道动态系统的非线性特征表示的重要性,它可以独立于说话人识别进行操作。可以预见,语音识别也将受益于这种非线性特征。