Suppr超能文献

言语视觉:基于端到端深度学习的构音障碍自动语音识别系统。

Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System.

出版信息

IEEE Trans Neural Syst Rehabil Eng. 2021;29:852-861. doi: 10.1109/TNSRE.2021.3076778. Epub 2021 May 7.

Abstract

Dysarthria is a disorder that affects an individual's speech intelligibility due to the paralysis of muscles and organs involved in the articulation process. As the condition is often associated with physically debilitating disabilities, not only do such individuals face communication problems, but also interactions with digital devices can become a burden. For these individuals, automatic speech recognition (ASR) technologies can make a significant difference in their lives as computing and portable digital devices can become an interaction medium, enabling them to communicate with others and computers. However, ASR technologies have performed poorly in recognizing dysarthric speech, especially for severe dysarthria, due to multiple challenges facing dysarthric ASR systems. We identified these challenges are due to the alternation and inaccuracy of dysarthric phonemes, the scarcity of dysarthric speech data, and the phoneme labeling imprecision. This paper reports on our second dysarthric-specific ASR system, called Speech Vision (SV) that tackles these challenges by adopting a novel approach towards dysarthric ASR in which speech features are extracted visually, then SV learns to see the shape of the words pronounced by dysarthric individuals. This visual acoustic modeling feature of SV eliminates phoneme-related challenges. To address the data scarcity problem, SV adopts visual data augmentation techniques, generates synthetic dysarthric acoustic visuals, and leverages transfer learning. Benchmarking with other state-of-the-art dysarthric ASR considered in this study, SV outperformed them by improving recognition accuracies for 67% of UA-Speech speakers, where the biggest improvements were achieved for severe dysarthria.

摘要

构音障碍是一种由于参与发音过程的肌肉和器官瘫痪而影响个体言语清晰度的疾病。由于这种情况通常与身体残疾有关,因此这些患者不仅面临沟通问题,而且与数字设备的交互也可能成为负担。对于这些患者来说,自动语音识别 (ASR) 技术可以极大地改善他们的生活,因为计算和便携式数字设备可以成为交互媒介,使他们能够与他人和计算机进行交流。然而,由于构音障碍 ASR 系统面临着多种挑战,ASR 技术在识别构音障碍语音方面表现不佳,尤其是对于严重的构音障碍。我们发现这些挑战归因于构音障碍音素的变化和不准确、构音障碍语音数据的稀缺以及音素标注的不精确。本文介绍了我们的第二个专门针对构音障碍的 ASR 系统,称为 Speech Vision (SV),它通过采用一种新颖的方法来解决这些挑战,即通过视觉方式提取语音特征,然后让 SV 学习识别构音障碍患者发音的单词形状。SV 的这种视觉声学建模功能消除了与音素相关的挑战。为了解决数据稀缺问题,SV 采用了视觉数据增强技术,生成了合成的构音障碍声学视觉,并利用了迁移学习。与本研究中考虑的其他最先进的构音障碍 ASR 进行基准测试,SV 通过提高 67%的 UA-Speech 说话者的识别准确率超越了它们,其中严重构音障碍的识别准确率提高幅度最大。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验