Shivakumar Prashanth Gurunath, Georgiou Panayiotis
Signal Processing for Communication Understanding & Behavior Analysis (SCUBA) Lab, University of Southern California, Los Angeles, California, USA.
Comput Speech Lang. 2020 Sep;63. doi: 10.1016/j.csl.2020.101077. Epub 2020 Feb 18.
Children speech recognition is challenging mainly due to the inherent high variability in children's physical and articulatory characteristics and expressions. This variability manifests in both acoustic constructs and linguistic usage due to the rapidly changing developmental stage in children's life. Part of the challenge is due to the lack of large amounts of available children speech data for efficient modeling. This work attempts to address the key challenges using transfer learning from adult's models to children's models in a Deep Neural Network (DNN) framework for children's Automatic Speech Recognition (ASR) task evaluating on multiple children's speech corpora with a large vocabulary. The paper presents a systematic and an extensive analysis of the proposed transfer learning technique considering the key factors affecting children's speech recognition from prior literature. are presented on (i) comparisons of earlier GMM-HMM and the newer DNN Models, (ii) effectiveness of standard adaptation techniques versus transfer learning, (iii) various adaptation configurations in tackling the variabilities present in children speech, in terms of (a) acoustic spectral variability, and (b) pronunciation variability and linguistic constraints. Our spans over (i) number of DNN model parameters (for adaptation), (ii) amount of adaptation data, (iii) ages of children, (iv) age dependent-independent adaptation. Finally, we provide on (i) the favorable strategies over various aforementioned - analyzed parameters, and (ii) potential future research directions and relevant challenges/problems persisting in DNN based ASR for children's speech.
儿童语音识别具有挑战性,主要是因为儿童的身体和发音特征及表达方式存在固有的高度变异性。由于儿童在成长过程中发育阶段快速变化,这种变异性在声学结构和语言使用中都有体现。部分挑战源于缺乏大量可用的儿童语音数据用于高效建模。这项工作试图在深度神经网络(DNN)框架下,通过从成人模型到儿童模型的迁移学习来解决关键挑战,以用于儿童自动语音识别(ASR)任务,该任务在多个具有大词汇量的儿童语音语料库上进行评估。本文基于先前文献中影响儿童语音识别的关键因素,对所提出的迁移学习技术进行了系统而广泛的分析。呈现了以下内容:(i)早期高斯混合模型 - 隐马尔可夫模型(GMM - HMM)与更新的DNN模型的比较;(ii)标准自适应技术与迁移学习的有效性;(iii)在应对儿童语音中存在的变异性方面的各种自适应配置,包括(a)声学频谱变异性和(b)发音变异性及语言限制。我们的研究涵盖了(i)DNN模型参数数量(用于自适应);(ii)自适应数据量;(iii)儿童年龄;(iv)年龄相关 - 无关自适应。最后,我们提供了关于(i)在上述各种分析参数上的有利策略,以及(ii)基于DNN的儿童语音ASR中潜在的未来研究方向和持续存在的相关挑战/问题的内容。