• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 BPE-Dropout 的动态声学单元增强在低资源端到端语音识别中的应用。

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition.

机构信息

Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia.

STC-Innovations Ltd., 194044 Saint-Petersburg, Russia.

出版信息

Sensors (Basel). 2021 Apr 28;21(9):3063. doi: 10.3390/s21093063.

DOI:10.3390/s21093063
PMID:33924798
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8124527/
Abstract

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

摘要

随着语音助手的快速发展,将面向服务器的自动语音识别 (ASR) 解决方案适配到直接设备变得至关重要。对于设备上的语音识别任务,研究人员和行业更倾向于端到端 ASR 系统,因为与混合系统相比,它们可以在保持更高质量的同时实现资源高效。然而,构建端到端模型需要大量的语音数据。个性化处理(主要处理词汇外 (OOV) 单词)是与语音助手相关的另一个具有挑战性的任务。在这项工作中,我们考虑在资源有限且 OOV 率较高的环境中构建有效的端到端 ASR 系统,这体现在 Babel Turkish 和 Babel Georgian 任务中。我们提出了一种基于字节对编码 (BPE) 和随机失活 (dropout) 技术的动态声学单元扩充方法。该方法通过非确定性地对语音进行分词,扩展了词汇的上下文,并对其分布进行正则化,以便模型识别未见过的单词。它还减少了对最佳子词词汇大小搜索的需求。该技术在常规和个性化(面向 OOV)语音识别任务中提供了稳定的改进(相对字错误率 (WER) 至少提高 6%,相对 F 分数提高 25%),而无需额外的计算成本。由于使用了 BPE-dropout,我们的单语土耳其 Conformer 以 22.2%的字符错误率 (CER) 和 38.9%的 WER 实现了有竞争力的结果,接近最佳已发布的多语言系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/7d9d1c524ecd/sensors-21-03063-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/7302a1d9270c/sensors-21-03063-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/a7797635894b/sensors-21-03063-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/500692745c76/sensors-21-03063-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/eada76e18a90/sensors-21-03063-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/cada615e2759/sensors-21-03063-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/d001ce21f90f/sensors-21-03063-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/7d9d1c524ecd/sensors-21-03063-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/7302a1d9270c/sensors-21-03063-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/a7797635894b/sensors-21-03063-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/500692745c76/sensors-21-03063-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/eada76e18a90/sensors-21-03063-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/cada615e2759/sensors-21-03063-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/d001ce21f90f/sensors-21-03063-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3b3/8124527/7d9d1c524ecd/sensors-21-03063-g007.jpg

相似文献

1
Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition.基于 BPE-Dropout 的动态声学单元增强在低资源端到端语音识别中的应用。
Sensors (Basel). 2021 Apr 28;21(9):3063. doi: 10.3390/s21093063.
2
Emphasizing unseen words: New vocabulary acquisition for end-to-end speech recognition.强调未见过的单词:用于端到端语音识别的新词汇习得
Neural Netw. 2023 Apr;161:494-504. doi: 10.1016/j.neunet.2023.01.027. Epub 2023 Feb 10.
3
Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition.改进用于黏着语语音识别的混合CTC/注意力架构
Sensors (Basel). 2022 Sep 27;22(19):7319. doi: 10.3390/s22197319.
4
Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition.两步联合优化与辅助损失函数的噪声鲁棒语音识别。
Sensors (Basel). 2022 Jul 19;22(14):5381. doi: 10.3390/s22145381.
5
A comparison of automatic and human speech recognition in null grammar.自动语音识别与零语法下的人工语音识别比较。
J Acoust Soc Am. 2012 Mar;131(3):EL256-61. doi: 10.1121/1.3684744.
6
Automatic Speech Recognition Performance Improvement for Mandarin Based on Optimizing Gain Control Strategy.基于优化增益控制策略的普通话自动语音识别性能提升
Sensors (Basel). 2022 Apr 15;22(8):3027. doi: 10.3390/s22083027.
7
The development of an automatic speech recognition model using interview data from long-term care for older adults.利用老年人长期护理访谈数据开发自动语音识别模型。
J Am Med Inform Assoc. 2023 Feb 16;30(3):411-417. doi: 10.1093/jamia/ocac241.
8
Using Hybrid HMM/DNN Embedding Extractor Models in Computational Paralinguistic Tasks.在计算副语言任务中使用混合 HMM/DNN 嵌入提取器模型。
Sensors (Basel). 2023 May 30;23(11):5208. doi: 10.3390/s23115208.
9
Using Automatic Speech Recognition to Assess Thai Speech Language Fluency in the Montreal Cognitive Assessment (MoCA).利用自动语音识别评估蒙特利尔认知评估(MoCA)中的泰语言语流畅度。
Sensors (Basel). 2022 Feb 17;22(4):1583. doi: 10.3390/s22041583.
10
Multiexpert automatic speech recognition using acoustic and myoelectric signals.使用声学和肌电信号的多专家自动语音识别
IEEE Trans Biomed Eng. 2006 Apr;53(4):676-85. doi: 10.1109/TBME.2006.870224.

引用本文的文献

1
Audio Augmentation for Non-Native Children's Speech Recognition through Discriminative Learning.通过判别式学习实现非母语儿童语音识别的音频增强
Entropy (Basel). 2022 Oct 19;24(10):1490. doi: 10.3390/e24101490.
2
Deep Learning Framework for Controlling Work Sequence in Collaborative Human-Robot Assembly Processes.深度学习框架在协作式人机装配过程中控制作业序列。
Sensors (Basel). 2023 Jan 3;23(1):553. doi: 10.3390/s23010553.