• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于预训练模型的构音障碍语音识别的多阶段视听融合

Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models.

出版信息

IEEE Trans Neural Syst Rehabil Eng. 2023;31:1912-1921. doi: 10.1109/TNSRE.2023.3262001.

DOI:10.1109/TNSRE.2023.3262001
PMID:37030692
Abstract

Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.

摘要

构音障碍语音识别有助于构音障碍患者进行更好的交流。然而,收集构音障碍语音较为困难。机器学习模型无法充分利用构音障碍语音进行训练。为了进一步提高构音障碍语音识别的准确性,我们提出了一种多阶段视听-HuBERT(MAV-HuBERT)框架,通过融合构音障碍语音的视觉信息和声学信息来进一步提高准确性。在第一阶段,我们提出使用卷积神经网络模型通过合并所有面部言语功能区域来编码运动信息。这种操作与传统的仅基于音频-视觉融合框架中嘴唇运动的方法不同。在第二阶段,我们提出使用 AV-HuBERT 框架来预训练融合构音障碍语音的音频和视觉信息的识别架构。预训练模型获得的知识应用于解决模型的过拟合问题。基于 UASpeech 的实验旨在评估我们提出的方法。与基线方法的结果相比,我们提出的方法在中度构音障碍语音上的最佳单词错误率(WER)降低了 13.5%。此外,对于轻度构音障碍语音,我们提出的方法显示出最佳的结果,我们提出的方法的 WER 达到 6.05%。即使对于极重度构音障碍语音,我们提出的方法的 WER 也达到了 63.98%,与 wav2vec 和 HuBERT 的 WER 相比,分别降低了 2.72%和 4.02%。该方法可以有效地进一步降低构音障碍语音的 WER。

相似文献

1
Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models.基于预训练模型的构音障碍语音识别的多阶段视听融合
IEEE Trans Neural Syst Rehabil Eng. 2023;31:1912-1921. doi: 10.1109/TNSRE.2023.3262001.
2
A multi-views multi-learners approach towards dysarthric speech recognition using multi-nets artificial neural networks.一种使用多网络人工神经网络的多视图多学习者方法用于构音障碍语音识别。
IEEE Trans Neural Syst Rehabil Eng. 2014 Sep;22(5):1053-63. doi: 10.1109/TNSRE.2014.2309336. Epub 2014 Mar 11.
3
Dysarthric Speech Enhancement Based on Convolution Neural Network.基于卷积神经网络的构音障碍语音增强。
Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul;2022:60-64. doi: 10.1109/EMBC48229.2022.9871531.
4
Dysarthric Speech Transformer: A Sequence-to-Sequence Dysarthric Speech Recognition System.构音障碍语音转换器:一种序列到序列的构音障碍语音识别系统。
IEEE Trans Neural Syst Rehabil Eng. 2023;31:3407-3416. doi: 10.1109/TNSRE.2023.3307020. Epub 2023 Aug 29.
5
Estimation of phoneme-specific HMM topologies for the automatic recognition of dysarthric speech.用于语音识别的特定音位 HMM 拓扑结构的估计。
Comput Math Methods Med. 2013;2013:297860. doi: 10.1155/2013/297860. Epub 2013 Oct 8.
6
Improving Acoustic Models in TORGO Dysarthric Speech Database.改善 TORGO 构音障碍语音数据库中的声学模型。
IEEE Trans Neural Syst Rehabil Eng. 2018 Mar;26(3):637-645. doi: 10.1109/TNSRE.2018.2802914.
7
Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System.言语视觉:基于端到端深度学习的构音障碍自动语音识别系统。
IEEE Trans Neural Syst Rehabil Eng. 2021;29:852-861. doi: 10.1109/TNSRE.2021.3076778. Epub 2021 May 7.
8
Intelligibility of dysarthric speech: perceptions of speakers and listeners.构音障碍性言语的可懂度:说话者与听众的认知
Int J Lang Commun Disord. 2008 Nov-Dec;43(6):633-48. doi: 10.1080/13682820801887117.
9
Vocal tract representation in the recognition of cerebral palsied speech.声道特征在脑瘫语音识别中的应用。
J Speech Lang Hear Res. 2012 Aug;55(4):1190-207. doi: 10.1044/1092-4388(2011/11-0223). Epub 2012 Jan 23.
10
Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation.通过仿真和合成增强改进构音障碍语音分割。
IEEE J Transl Eng Health Med. 2024 Mar 11;12:382-389. doi: 10.1109/JTEHM.2024.3375323. eCollection 2024.

引用本文的文献

1
A novel Swin transformer based framework for speech recognition for dysarthria.一种基于新型Swin变压器的构音障碍语音识别框架。
Sci Rep. 2025 Jun 16;15(1):20070. doi: 10.1038/s41598-025-02042-7.