基于变压器的端到端哈萨克语语音识别系统研究。

A study of transformer-based end-to-end speech recognition system for Kazakh language.

机构信息

Institute of Information and Computational Technologies CS MES RK, Almaty, Kazakhstan.

Satbayev University, Almaty, Kazakhstan.

出版信息

Sci Rep. 2022 May 18;12(1):8337. doi: 10.1038/s41598-022-12260-y.

DOI:10.1038/s41598-022-12260-y

PMID:35585130

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9117202/

Abstract

Today, the Transformer model, which allows parallelization and also has its own internal attention, has been widely used in the field of speech recognition. The great advantage of this architecture is the fast learning speed, and the lack of sequential operation, as with recurrent neural networks. In this work, Transformer models and an end-to-end model based on connectionist temporal classification were considered to build a system for automatic recognition of Kazakh speech. It is known that Kazakh is part of a number of agglutinative languages and has limited data for implementing speech recognition systems. Some studies have shown that the Transformer model improves system performance for low-resource languages. Based on our experiments, it was revealed that the joint use of Transformer and connectionist temporal classification models contributed to improving the performance of the Kazakh speech recognition system and with an integrated language model it showed the best character error rate 3.7% on a clean dataset.

摘要

如今，Transformer 模型在语音识别领域得到了广泛应用，它允许并行化处理，并且具有自己的内部注意力机制。这种架构的最大优点是学习速度快，并且不像递归神经网络那样需要顺序操作。在这项工作中，我们考虑了 Transformer 模型和基于连接时间分类的端到端模型，以构建一个用于自动识别哈萨克语的系统。众所周知，哈萨克语是粘着语的一部分，并且用于实现语音识别系统的数据有限。一些研究表明，Transformer 模型可以提高资源有限语言的系统性能。基于我们的实验，揭示了 Transformer 和连接时间分类模型的联合使用有助于提高哈萨克语语音识别系统的性能，并且在使用集成语言模型时，在干净的数据集上表现出最佳的字符错误率 3.7%。