用于改善构音障碍语音自动语音识别性能的两阶段数据增强

Two-stage data augmentation for improved ASR performance for dysarthric speech.

作者信息

Bhat Chitralekha, Strik Helmer

机构信息

Centre for Language and Speech Technology (CLST), Radboud University Nijmegen, The Netherlands.

Centre for Language and Speech Technology (CLST), Radboud University Nijmegen, The Netherlands; Centre for Language Studies (CLS), Radboud University Nijmegen, The Netherlands; Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, The Netherlands.

出版信息

Comput Biol Med. 2025 May;189:109954. doi: 10.1016/j.compbiomed.2025.109954. Epub 2025 Mar 13.

DOI:10.1016/j.compbiomed.2025.109954

PMID:40086291

Abstract

Machine learning (ML) and Deep Neural Networks (DNN) have greatly aided the problem of Automatic Speech Recognition (ASR). However, accurate ASR for dysarthric speech remains a serious challenge. The dearth of usable data remains a problem in applying ML and DNN techniques for dysarthric speech recognition. In the current research, we address this challenge using a novel two-stage data augmentation scheme, a combination of static and dynamic data augmentation techniques, designed by leveraging an understanding of the characteristics of dysarthric speech. We explore speaker-independent ASR using modifications to healthy speech using various perturbations, devoicing of consonants, and voice conversion, comprising stage one or static augmentations. Subsequent to the first stage, a modified SpecAugment algorithm tailored for dysarthric speech is employed. This variant, termed Dysarthric SpecAugment, leverages the characteristics of dysarthric speech and forms the second stage of the two-stage augmentation approach. This acoustic model is used to pre-train a speaker-dependent ASR using dysarthric speech. The objective of this work is to improve the ASR performance for dysarthric speech using the two-stage data augmentation scheme. An end-to-end ASR using a Transformer acoustic model is used to evaluate the data augmentation scheme on speech from the UA dysarthric speech corpus. We achieve an absolute improvement of 10.7% and a relative improvement of 29.2% in word error rate (WER) over a baseline with no augmentation, with a final WER of 25.9% for the speaker-dependent system.

摘要

机器学习（ML）和深度神经网络（DNN）极大地推动了自动语音识别（ASR）问题的解决。然而，对构音障碍语音进行准确的ASR仍然是一项严峻的挑战。在将ML和DNN技术应用于构音障碍语音识别时，可用数据的匮乏仍然是一个问题。在当前的研究中，我们通过一种新颖的两阶段数据增强方案来应对这一挑战，该方案结合了静态和动态数据增强技术，是在对构音障碍语音特征的理解基础上设计的。我们通过对健康语音进行各种扰动、辅音清化和语音转换等修改来探索与说话者无关的ASR，这构成了第一阶段或静态增强。在第一阶段之后，采用了一种针对构音障碍语音量身定制的改进型SpecAugment算法。这种变体称为构音障碍SpecAugment，它利用了构音障碍语音的特征，构成了两阶段增强方法的第二阶段。这个声学模型用于使用构音障碍语音对与说话者相关的ASR进行预训练。这项工作的目标是使用两阶段数据增强方案来提高构音障碍语音的ASR性能。使用Transformer声学模型的端到端ASR用于评估来自UA构音障碍语音语料库的语音上的数据增强方案。与无增强的基线相比，我们在单词错误率（WER）上实现了10.7%的绝对提升和29.2%的相对提升，与说话者相关系统的最终WER为25.9%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于改善构音障碍语音自动语音识别性能的两阶段数据增强

Two-stage data augmentation for improved ASR performance for dysarthric speech.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

用于改善构音障碍语音自动语音识别性能的两阶段数据增强

Two-stage data augmentation for improved ASR performance for dysarthric speech.

作者信息

机构信息

出版信息

相似文献

引用本文的文献