Bawa Puneet, Kadyan Virender, Tripathy Abinash, Singh Thipendra P
Centre of Excellence for Speech and Multimodal Laboratory, Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India.
Speech and Language Research Centre, School of Computer Science, University of Petroleum and Energy Studies (UPES), Energy Acres, Bidholi, Dehradun, Uttarakhand 248007 India.
Complex Intell Systems. 2023;9(1):1-23. doi: 10.1007/s40747-022-00651-7. Epub 2022 Jun 2.
Development of a native language robust ASR framework is very challenging as well as an active area of research. Although an urge for investigation of effective front-end as well as back-end approaches are required for tackling environment differences, large training complexity and inter-speaker variability in achieving success of a recognition system. In this paper, four front-end approaches: mel-frequency cepstral coefficients (MFCC), Gammatone frequency cepstral coefficients (GFCC), relative spectral-perceptual linear prediction (RASTA-PLP) and power-normalized cepstral coefficients (PNCC) have been investigated to generate unique and robust feature vectors at different SNR values. Furthermore, to handle the large training data complexity, parameter optimization has been performed with sequence-discriminative training techniques: maximum mutual information (MMI), minimum phone error (MPE), boosted-MMI (bMMI), and state-level minimum Bayes risk (sMBR). It has been demonstrated by selection of an optimal value of parameters using lattice generation, and adjustments of learning rates. In proposed framework, four different systems have been tested by analyzing various feature extraction approaches (with or without speaker normalization through Vocal Tract Length Normalization (VTLN) approach in test set) and classification strategy on with or without artificial extension of train dataset. To compare each system performance, true matched (adult train and test-S1, child train and test-S2) and mismatched (adult train and child test-S3, adult + child train and child test-S4) systems on large adult and very small Punjabi clean speech corpus have been demonstrated. Consequently, gender-based in-domain data augmented is used to moderate acoustic and phonetic variations throughout adult and children's speech under mismatched conditions. The experiment result shows that an effective framework developed on PNCC + VTLN front-end approach using TDNN-sMBR-based model through parameter optimization technique yields a relative improvement (RI) of 40.18%, 47.51%, and 49.87% in matched, mismatched and gender-based in-domain augmented system under typical clean and noisy conditions, respectively.
开发一个强大的母语自动语音识别(ASR)框架极具挑战性,同时也是一个活跃的研究领域。尽管为应对环境差异、巨大的训练复杂性以及说话者之间的变异性以实现识别系统的成功,需要迫切研究有效的前端和后端方法。在本文中,研究了四种前端方法:梅尔频率倒谱系数(MFCC)、伽马通频率倒谱系数(GFCC)、相对谱感知线性预测(RASTA-PLP)和功率归一化倒谱系数(PNCC),以在不同信噪比(SNR)值下生成独特且强大的特征向量。此外,为处理大量训练数据的复杂性,使用序列判别训练技术进行了参数优化:最大互信息(MMI)、最小音素错误(MPE)、增强型MMI(bMMI)和状态级最小贝叶斯风险(sMBR)。通过使用格生成选择参数的最优值以及调整学习率,已证明了这一点。在所提出的框架中,通过分析各种特征提取方法(在测试集中有无通过声道长度归一化(VTLN)方法进行说话者归一化)以及在有无人工扩展训练数据集的情况下的分类策略,对四个不同的系统进行了测试。为比较每个系统的性能,在大型成人和非常小的旁遮普语纯净语音语料库上展示了真匹配(成人训练和测试-S1,儿童训练和测试-S2)和不匹配(成人训练和儿童测试-S3,成人+儿童训练和儿童测试-S4)系统。因此,基于性别的域内数据增强被用于在不匹配条件下缓解成人和儿童语音中的声学和语音变化。实验结果表明,通过参数优化技术在基于PNCC + VTLN前端方法上使用基于TDNN-sMBR的模型开发的有效框架,在典型的纯净和噪声条件下,在匹配、不匹配和基于性别的域内增强系统中分别产生了40.18%、47.51%和49.87%的相对改进(RI)。