Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia.
STC-Innovations Ltd., 194044 Saint-Petersburg, Russia.
Sensors (Basel). 2021 Apr 28;21(9):3063. doi: 10.3390/s21093063.
With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.
随着语音助手的快速发展,将面向服务器的自动语音识别 (ASR) 解决方案适配到直接设备变得至关重要。对于设备上的语音识别任务,研究人员和行业更倾向于端到端 ASR 系统,因为与混合系统相比,它们可以在保持更高质量的同时实现资源高效。然而,构建端到端模型需要大量的语音数据。个性化处理(主要处理词汇外 (OOV) 单词)是与语音助手相关的另一个具有挑战性的任务。在这项工作中,我们考虑在资源有限且 OOV 率较高的环境中构建有效的端到端 ASR 系统,这体现在 Babel Turkish 和 Babel Georgian 任务中。我们提出了一种基于字节对编码 (BPE) 和随机失活 (dropout) 技术的动态声学单元扩充方法。该方法通过非确定性地对语音进行分词,扩展了词汇的上下文,并对其分布进行正则化,以便模型识别未见过的单词。它还减少了对最佳子词词汇大小搜索的需求。该技术在常规和个性化(面向 OOV)语音识别任务中提供了稳定的改进(相对字错误率 (WER) 至少提高 6%,相对 F 分数提高 25%),而无需额外的计算成本。由于使用了 BPE-dropout,我们的单语土耳其 Conformer 以 22.2%的字符错误率 (CER) 和 38.9%的 WER 实现了有竞争力的结果,接近最佳已发布的多语言系统。