Qu Leyuan, Weber Cornelius, Wermter Stefan
Knowledge Technology, Department of Informatics, University of Hamburg, Hamburg, Germany; Department of Artificial Intelligence, Zhejiang Laboratory, Hangzhou, China.
Knowledge Technology, Department of Informatics, University of Hamburg, Hamburg, Germany.
Neural Netw. 2023 Apr;161:494-504. doi: 10.1016/j.neunet.2023.01.027. Epub 2023 Feb 10.
Due to the dynamic nature of human language, automatic speech recognition (ASR) systems need to continuously acquire new vocabulary. Out-Of-Vocabulary (OOV) words, such as trending words and new named entities, pose problems to modern ASR systems that require long training times to adapt their large numbers of parameters. Different from most previous research focusing on language model post-processing, we tackle this problem on an earlier processing level and eliminate the bias in acoustic modeling to recognize OOV words acoustically. We propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to OOV words. Specifically, we enlarge the classification loss used for training neural networks' parameters of utterances containing OOV words (sentence-level), or rescale the gradient used for back-propagation for OOV words (word-level), when fine-tuning a previously trained model on synthetic audio. To overcome catastrophic forgetting, we also explore the combination of loss rescaling and model regularization, i.e. L2 regularization and elastic weight consolidation (EWC). Compared with previous methods that just fine-tune synthetic audio with EWC, the experimental results on the LibriSpeech benchmark reveal that our proposed loss rescaling approach can achieve significant improvement on the recall rate with only a slight decrease on word error rate. Moreover, word-level rescaling is more stable than utterance-level rescaling and leads to higher recall rates and precision rates on OOV word recognition. Furthermore, our proposed combined loss rescaling and weight consolidation methods can support continual learning of an ASR system.
由于人类语言的动态特性,自动语音识别(ASR)系统需要不断获取新词汇。词汇表外(OOV)的单词,如热门词汇和新的命名实体,给现代ASR系统带来了问题,这些系统需要很长的训练时间来调整其大量参数。与以往大多数专注于语言模型后处理的研究不同,我们在更早的处理阶段解决这个问题,并消除声学建模中的偏差,以便从声学上识别OOV单词。我们建议使用文本转语音系统生成OOV单词,并重新调整损失,以鼓励神经网络更多地关注OOV单词。具体来说,当在合成音频上微调先前训练的模型时,我们扩大用于训练包含OOV单词的话语(句子级)的神经网络参数的分类损失,或者重新调整用于OOV单词反向传播的梯度(单词级)。为了克服灾难性遗忘,我们还探索了损失重新调整和模型正则化的组合,即L2正则化和弹性权重巩固(EWC)。与仅使用EWC微调合成音频的先前方法相比,在LibriSpeech基准测试上的实验结果表明,我们提出的损失重新调整方法可以在召回率上取得显著提高,而单词错误率仅略有下降。此外,单词级重新调整比句子级重新调整更稳定,并且在OOV单词识别上导致更高的召回率和精确率。此外,我们提出的损失重新调整和权重巩固相结合的方法可以支持ASR系统的持续学习。