• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于无监督预训练的哈萨克语语音识别研究

A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training.

机构信息

Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, China.

College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China.

出版信息

Sensors (Basel). 2023 Jan 12;23(2):870. doi: 10.3390/s23020870.

DOI:10.3390/s23020870
PMID:36679666
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9863384/
Abstract

Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.

摘要

构建一个好的语音识别系统通常需要大量的配对数据,这对于低资源语言(如哈萨克语)来说是一个巨大的挑战。近年来,无监督预训练在低资源语音识别中取得了很好的效果,但在哈萨克语和其他中亚和西亚语言中很少使用。本文通过在 wav2vec2.0 中集成 Factorized TDNN 层来改进无监督预训练策略,以更好地保留语音和量化前后时间步之间的关系,称之为 wav2vec-F。该方法利用大量未标记的音频数据学习潜在的语音表示,并应用于跨语言语音识别任务,同时利用噪声对比二进制分类任务进行优化。此外,还使用语音合成来促进语音识别的性能。实验表明,wav2vec-F 可以有效地利用非目标语言的未标记数据,多语言预训练明显优于单语言预训练。使用语音合成进行数据增强的方法可以带来巨大的收益。与基线模型相比,Librispeech 的测试集在字错误率方面平均降低了 1.9%。在哈萨克语 KSC 测试集上,仅使用哈萨克语的预训练将字错误率降低了 3.8%。当标记数据仅为 10 小时时,使用多语言结合 TTS 合成的少量哈萨克语语音数据的预训练在 KSC 测试集上的字错误率为 8.6%,与之前使用 30 倍标记数据的端到端模型的结果相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/fe278eafd4d3/sensors-23-00870-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/56d86db0bc52/sensors-23-00870-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/c170e66b344b/sensors-23-00870-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/0796816b43f1/sensors-23-00870-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/96f1cd09ff4f/sensors-23-00870-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/bd6130396f03/sensors-23-00870-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/fe278eafd4d3/sensors-23-00870-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/56d86db0bc52/sensors-23-00870-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/c170e66b344b/sensors-23-00870-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/0796816b43f1/sensors-23-00870-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/96f1cd09ff4f/sensors-23-00870-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/bd6130396f03/sensors-23-00870-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a94a/9863384/fe278eafd4d3/sensors-23-00870-g006.jpg

相似文献

1
A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training.基于无监督预训练的哈萨克语语音识别研究
Sensors (Basel). 2023 Jan 12;23(2):870. doi: 10.3390/s23020870.
2
A study of transformer-based end-to-end speech recognition system for Kazakh language.基于变压器的端到端哈萨克语语音识别系统研究。
Sci Rep. 2022 May 18;12(1):8337. doi: 10.1038/s41598-022-12260-y.
3
Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition.改进用于黏着语语音识别的混合CTC/注意力架构
Sensors (Basel). 2022 Sep 27;22(19):7319. doi: 10.3390/s22197319.
4
Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets.多语言端到端 ASR 用于资源匮乏的具有通用字母表的突厥语。
Sci Rep. 2024 Jun 15;14(1):13835. doi: 10.1038/s41598-024-64848-1.
5
Using Automatic Speech Recognition to Assess Thai Speech Language Fluency in the Montreal Cognitive Assessment (MoCA).利用自动语音识别评估蒙特利尔认知评估(MoCA)中的泰语言语流畅度。
Sensors (Basel). 2022 Feb 17;22(4):1583. doi: 10.3390/s22041583.
6
Development of Language Models for Continuous Uzbek Speech Recognition System.开发用于乌兹别克语连续语音识别系统的语言模型。
Sensors (Basel). 2023 Jan 19;23(3):1145. doi: 10.3390/s23031145.
7
Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition.两步联合优化与辅助损失函数的噪声鲁棒语音识别。
Sensors (Basel). 2022 Jul 19;22(14):5381. doi: 10.3390/s22145381.
8
Non-native acoustic modeling for mispronunciation verification based on language adversarial representation learning.基于语言对抗表示学习的非母语发音验证的声学建模。
Neural Netw. 2021 Oct;142:597-607. doi: 10.1016/j.neunet.2021.07.017. Epub 2021 Jul 17.
9
Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data.通过对训练数据进行噪声增强来提高语音命令识别的抗噪声能力。
Sensors (Basel). 2020 Apr 19;20(8):2326. doi: 10.3390/s20082326.
10
Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language.基于深度学习方法的乌兹别克语自动语音识别方法。
Sensors (Basel). 2022 May 12;22(10):3683. doi: 10.3390/s22103683.

引用本文的文献

1
Continuous Sign Language Recognition and Its Translation into Intonation-Colored Speech.连续手语识别及其语调色彩语音的翻译。
Sensors (Basel). 2023 Jul 13;23(14):6383. doi: 10.3390/s23146383.

本文引用的文献

1
Product quantization for nearest neighbor search.基于乘积量化的最近邻搜索。
IEEE Trans Pattern Anal Mach Intell. 2011 Jan;33(1):117-28. doi: 10.1109/TPAMI.2010.57.