基于预训练模型的构音障碍语音识别的多阶段视听融合

Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models.

出版信息

IEEE Trans Neural Syst Rehabil Eng. 2023;31:1912-1921. doi: 10.1109/TNSRE.2023.3262001.

DOI:10.1109/TNSRE.2023.3262001

Abstract

Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.

摘要

构音障碍语音识别有助于构音障碍患者进行更好的交流。然而，收集构音障碍语音较为困难。机器学习模型无法充分利用构音障碍语音进行训练。为了进一步提高构音障碍语音识别的准确性，我们提出了一种多阶段视听-HuBERT（MAV-HuBERT）框架，通过融合构音障碍语音的视觉信息和声学信息来进一步提高准确性。在第一阶段，我们提出使用卷积神经网络模型通过合并所有面部言语功能区域来编码运动信息。这种操作与传统的仅基于音频-视觉融合框架中嘴唇运动的方法不同。在第二阶段，我们提出使用 AV-HuBERT 框架来预训练融合构音障碍语音的音频和视觉信息的识别架构。预训练模型获得的知识应用于解决模型的过拟合问题。基于 UASpeech 的实验旨在评估我们提出的方法。与基线方法的结果相比，我们提出的方法在中度构音障碍语音上的最佳单词错误率（WER）降低了 13.5%。此外，对于轻度构音障碍语音，我们提出的方法显示出最佳的结果，我们提出的方法的 WER 达到 6.05%。即使对于极重度构音障碍语音，我们提出的方法的 WER 也达到了 63.98%，与 wav2vec 和 HuBERT 的 WER 相比，分别降低了 2.72%和 4.02%。该方法可以有效地进一步降低构音障碍语音的 WER。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于预训练模型的构音障碍语音识别的多阶段视听融合

Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models.

出版信息

相似文献

引用本文的文献

基于预训练模型的构音障碍语音识别的多阶段视听融合

Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models.

出版信息

相似文献

引用本文的文献