Division of Robotics Engineering, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu, India.
Division of Biomedical Engineering, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu, India.
Sci Rep. 2024 Nov 27;14(1):29455. doi: 10.1038/s41598-024-80764-w.
Dysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems. This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR). A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems. The S-SEGAN integrates SEGAN's adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance. Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR. These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented). Evaluations demonstrate significant DSR accuracy improvements in DSE integration. The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.60% and 69.87%, respectively. The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.07% and C-HAN reaching 73%. The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.40%, while that of the Conformer model reaches 74.33%. Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.73% and 76.87%, respectively. Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.47% and 79.20%, respectively. Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.13% and 84.07%, respectively, with C-HAN displaying the highest performance among all proposed models.
运动性言语障碍(Dysarthria)是一种影响发音和言语清晰度的言语障碍,给自动语音识别(ASR)系统带来了重大挑战。本研究提出了一种开创性的方法,以提高言语障碍识别(DSR)的准确性。一个主要的创新点在于将 SepFormer-Speech Enhancement Generative Adversarial Network(S-SEGAN)集成到 DSR 系统中,S-SEGAN 是一种针对言语障碍增强(DSE)的先进生成对抗网络,作为前端处理阶段。S-SEGAN 将 SEGAN 的对抗学习与 SepFormer 语音分离能力相结合,在性能上有了显著的提高。此外,还采用了多阶段迁移学习方法来评估基于词级和句子级的 DSR 模型。这些 DSR 模型首先在大型语音数据集(LibriSpeech)上进行训练,然后在言语障碍数据(孤立和增强)上进行微调。评估表明,在 DSE 集成中,DSR 准确性有了显著提高。没有 DSE 的言语障碍(DS)-基线模型(Transformer 和 Conformer)的单词识别准确率(Word Recognition Accuracy,WRA)分别为 68.60%和 69.87%。在 Transformer 和 Conformer 架构中引入层次注意网络(Hierarchical Attention Network,HAN)后,性能得到了提高,其中 T-HAN 的 WRA 为 71.07%,C-HAN 为 73%。孤立词的 Transformer 模型与 DSE+DSR 的 WRA 为 73.40%,而 Conformer 模型的 WRA 为 74.33%。值得注意的是,具有 DSE+DSR 的 T-HAN 和 C-HAN 模型的 WRA 分别提高到 75.73%和 76.87%,甚至更为显著。进一步增强单词可以提高模型性能,Transformer 和 Conformer 模型的 WRA 分别为 76.47%和 79.20%。值得注意的是,具有 DSE+DSR 和增强单词的 T-HAN 和 C-HAN 模型的 WRA 分别为 82.13%和 84.07%,其中 C-HAN 在所有提出的模型中表现出最高的性能。