开发用于乌兹别克语连续语音识别系统的语言模型。

Development of Language Models for Continuous Uzbek Speech Recognition System.

机构信息

Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of Korea.

Department of Information Technologies, Samarkand Branch of Tashkent University of Information Technologies Named after Muhammad al-Khwarizmi, Tashkent 140100, Uzbekistan.

出版信息

Sensors (Basel). 2023 Jan 19;23(3):1145. doi: 10.3390/s23031145.

DOI:10.3390/s23031145

PMID:36772184

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9919949/

Abstract

Automatic speech recognition systems with a large vocabulary and other natural language processing applications cannot operate without a language model. Most studies on pre-trained language models have focused on more popular languages such as English, Chinese, and various European languages, but there is no publicly available Uzbek speech dataset. Therefore, language models of low-resource languages need to be studied and created. The objective of this study is to address this limitation by developing a low-resource language model for the Uzbek language and understanding linguistic occurrences. We proposed the Uzbek language model named UzLM by examining the performance of statistical and neural-network-based language models that account for the unique features of the Uzbek language. Our Uzbek-specific linguistic representation allows us to construct more robust UzLM, utilizing 80 million words from various sources while using the same or fewer training words, as applied in previous studies. Roughly sixty-eight thousand different words and 15 million sentences were collected for the creation of this corpus. The experimental results of our tests on the continuous recognition of Uzbek speech show that, compared with manual encoding, the use of neural-network-based language models reduced the character error rate to 5.26%.

摘要

自动语音识别系统和其他自然语言处理应用程序如果没有语言模型就无法运行。大多数关于预训练语言模型的研究都集中在更流行的语言上，如英语、中文和各种欧洲语言，但乌兹别克语的语音数据集并不公开。因此，需要研究和创建低资源语言的语言模型。本研究的目的是通过为乌兹别克语开发一种低资源语言模型并理解语言现象来解决这一限制。我们通过检查统计和基于神经网络的语言模型的性能来提出名为 UzLM 的乌兹别克语模型，这些模型考虑了乌兹别克语的独特特征。我们的乌兹别克语特定的语言表示允许我们构建更强大的 UzLM，利用来自各种来源的 8000 万个单词，同时使用与以前研究相同或更少的训练单词。为了创建这个语料库，我们收集了大约 6.8 万个不同的单词和 1500 万句话。我们对乌兹别克语连续语音识别的测试结果表明，与手动编码相比，基于神经网络的语言模型将字符错误率降低到了 5.26%。