Satbayev University, Almaty, Kazakhstan.
Narxoz University, Almaty, Kazakhstan.
Sci Rep. 2024 Jun 15;14(1):13835. doi: 10.1038/s41598-024-64848-1.
To obtain a reliable and accurate automatic speech recognition (ASR) machine learning model, it is necessary to have sufficient audio data transcribed, for training. Many languages in the world, especially the agglutinative languages of the Turkic family, suffer from a lack of this type of data. Many studies have been conducted in order to obtain better models for low-resource languages, using different approaches. The most popular approaches include multilingual training and transfer learning. In this study, we combined five agglutinative languages from the Turkic family-Kazakh, Bashkir, Kyrgyz, Sakha, and Tatar,-in order to provide multilingual training using connectionist temporal classification and an attention mechanism including a language model, because these languages have cognate words, sentence formation rules, and alphabet (Cyrillic). Data from the open-source database Common voice was used for the study, to make the experiments reproducible. The results of the experiments showed that multilingual training could improve ASR performances for all languages included in the experiment, except Bashkir language. A dramatic result was achieved for the Kyrgyz language: word error rate decreased to nearly one-fifth and character error rate decreased to one-fourth, which proves that this approach can be helpful for critically low-resource languages.
为了获得可靠且准确的自动语音识别(ASR)机器学习模型,需要有足够的音频数据进行转录,以用于训练。世界上许多语言,特别是突厥语族的粘着语言,都缺乏这种类型的数据。为了获得针对低资源语言的更好模型,已经进行了许多研究,采用了不同的方法。最流行的方法包括多语言训练和迁移学习。在这项研究中,我们结合了突厥语族的五种粘着语言——哈萨克语、巴什基尔语、吉尔吉斯语、萨哈语和鞑靼语——以便使用连接时间分类和包括语言模型在内的注意力机制进行多语言训练,因为这些语言具有同源词、句子构成规则和字母表(西里尔字母)。研究使用了来自开源数据库 Common voice 的数据,以使实验可重现。实验结果表明,多语言训练可以提高实验中包含的所有语言的 ASR 性能,除了巴什基尔语。吉尔吉斯语的结果非常显著:单词错误率降低到近五分之一,字符错误率降低到四分之一,这证明了这种方法对于资源极度匮乏的语言很有帮助。