Bhat Chitralekha, Strik Helmer
Centre for Language and Speech Technology (CLST), Radboud University Nijmegen, The Netherlands.
Centre for Language and Speech Technology (CLST), Radboud University Nijmegen, The Netherlands; Centre for Language Studies (CLS), Radboud University Nijmegen, The Netherlands; Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, The Netherlands.
Comput Biol Med. 2025 May;189:109954. doi: 10.1016/j.compbiomed.2025.109954. Epub 2025 Mar 13.
Machine learning (ML) and Deep Neural Networks (DNN) have greatly aided the problem of Automatic Speech Recognition (ASR). However, accurate ASR for dysarthric speech remains a serious challenge. The dearth of usable data remains a problem in applying ML and DNN techniques for dysarthric speech recognition. In the current research, we address this challenge using a novel two-stage data augmentation scheme, a combination of static and dynamic data augmentation techniques, designed by leveraging an understanding of the characteristics of dysarthric speech. We explore speaker-independent ASR using modifications to healthy speech using various perturbations, devoicing of consonants, and voice conversion, comprising stage one or static augmentations. Subsequent to the first stage, a modified SpecAugment algorithm tailored for dysarthric speech is employed. This variant, termed Dysarthric SpecAugment, leverages the characteristics of dysarthric speech and forms the second stage of the two-stage augmentation approach. This acoustic model is used to pre-train a speaker-dependent ASR using dysarthric speech. The objective of this work is to improve the ASR performance for dysarthric speech using the two-stage data augmentation scheme. An end-to-end ASR using a Transformer acoustic model is used to evaluate the data augmentation scheme on speech from the UA dysarthric speech corpus. We achieve an absolute improvement of 10.7% and a relative improvement of 29.2% in word error rate (WER) over a baseline with no augmentation, with a final WER of 25.9% for the speaker-dependent system.
机器学习(ML)和深度神经网络(DNN)极大地推动了自动语音识别(ASR)问题的解决。然而,对构音障碍语音进行准确的ASR仍然是一项严峻的挑战。在将ML和DNN技术应用于构音障碍语音识别时,可用数据的匮乏仍然是一个问题。在当前的研究中,我们通过一种新颖的两阶段数据增强方案来应对这一挑战,该方案结合了静态和动态数据增强技术,是在对构音障碍语音特征的理解基础上设计的。我们通过对健康语音进行各种扰动、辅音清化和语音转换等修改来探索与说话者无关的ASR,这构成了第一阶段或静态增强。在第一阶段之后,采用了一种针对构音障碍语音量身定制的改进型SpecAugment算法。这种变体称为构音障碍SpecAugment,它利用了构音障碍语音的特征,构成了两阶段增强方法的第二阶段。这个声学模型用于使用构音障碍语音对与说话者相关的ASR进行预训练。这项工作的目标是使用两阶段数据增强方案来提高构音障碍语音的ASR性能。使用Transformer声学模型的端到端ASR用于评估来自UA构音障碍语音语料库的语音上的数据增强方案。与无增强的基线相比,我们在单词错误率(WER)上实现了10.7%的绝对提升和29.2%的相对提升,与说话者相关系统的最终WER为25.9%。