基于无监督预训练的哈萨克语语音识别研究

A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training.

机构信息

Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, China.

College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China.

出版信息

Sensors (Basel). 2023 Jan 12;23(2):870. doi: 10.3390/s23020870.

DOI:10.3390/s23020870

PMID:36679666

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9863384/

Abstract

Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.

摘要

构建一个好的语音识别系统通常需要大量的配对数据，这对于低资源语言（如哈萨克语）来说是一个巨大的挑战。近年来，无监督预训练在低资源语音识别中取得了很好的效果，但在哈萨克语和其他中亚和西亚语言中很少使用。本文通过在 wav2vec2.0 中集成 Factorized TDNN 层来改进无监督预训练策略，以更好地保留语音和量化前后时间步之间的关系，称之为 wav2vec-F。该方法利用大量未标记的音频数据学习潜在的语音表示，并应用于跨语言语音识别任务，同时利用噪声对比二进制分类任务进行优化。此外，还使用语音合成来促进语音识别的性能。实验表明，wav2vec-F 可以有效地利用非目标语言的未标记数据，多语言预训练明显优于单语言预训练。使用语音合成进行数据增强的方法可以带来巨大的收益。与基线模型相比，Librispeech 的测试集在字错误率方面平均降低了 1.9%。在哈萨克语 KSC 测试集上，仅使用哈萨克语的预训练将字错误率降低了 3.8%。当标记数据仅为 10 小时时，使用多语言结合 TTS 合成的少量哈萨克语语音数据的预训练在 KSC 测试集上的字错误率为 8.6%，与之前使用 30 倍标记数据的端到端模型的结果相当。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于无监督预训练的哈萨克语语音识别研究

A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

基于无监督预训练的哈萨克语语音识别研究

A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献