端到端约旦方言语音到文本的自监督学习框架。

End-to-end Jordanian dialect speech-to-text self-supervised learning framework.

作者信息

Safieh Ali A, Alhaol Ibrahim Abu, Ghnemat Rawan

机构信息

Data Science Department, King Hussein School of Computing Sciences, Princess Sumaya University for Technology, Amman, Jordan.

出版信息

Front Robot AI. 2022 Dec 22;9:1090012. doi: 10.3389/frobt.2022.1090012. eCollection 2022.

Speech-to-text engines are extremely needed nowadays for different applications, representing an essential enabler in human-robot interaction. Still, some languages suffer from the lack of labeled speech data, especially in the Arabic dialects or any low-resource languages. The need for a self-supervised training process and self-training using noisy training is proven to be one of the up-and-coming feasible solutions. This article proposes an end-to-end, transformers-based model with a framework for low-resource languages. In addition, the framework incorporates customized audio-to-text processing algorithms to achieve a highly efficient Jordanian Arabic dialect speech-to-text system. The proposed framework enables ingesting data from many sources, making the ground truth from external sources possible by speeding up the manual annotation process. The framework allows the training process using noisy student training and self-supervised learning to utilize the unlabeled data in both pre- and post-training stages and incorporate multiple types of data augmentation. The proposed self-training approach outperforms the fine-tuned Wav2Vec model by 5% in terms of word error rate reduction. The outcome of this work provides the research community with a Jordanian-spoken data set along with an end-to-end approach to deal with low-resource languages. This is done by utilizing the power of the pretraining, post-training, and injecting noisy labeled and augmented data with minimal human intervention. It enables the development of new applications in the field of Arabic language speech-to-text area like the question-answering systems and intelligent control systems, and it will add human-like perception and hearing sensors to intelligent robots.

如今，语音转文本引擎在不同应用中极其必要，是人机交互中的一项关键促成因素。然而，一些语言缺乏标注语音数据，尤其是阿拉伯方言或任何低资源语言。事实证明，对自监督训练过程和使用噪声训练进行自我训练的需求是可行的新兴解决方案之一。本文提出了一种基于Transformer的端到端模型，用于低资源语言的框架。此外，该框架还纳入了定制的音频到文本处理算法，以实现高效的约旦阿拉伯方言语音转文本系统。所提出的框架能够从多个来源摄取数据，通过加快人工标注过程使来自外部来源的真实数据成为可能。该框架允许使用噪声学生训练和自监督学习进行训练过程，以在训练前和训练后阶段利用未标注数据，并纳入多种类型的数据增强。所提出的自我训练方法在降低字错误率方面比微调后的Wav2Vec模型高出5%。这项工作的成果为研究界提供了一个约旦语数据集以及一种处理低资源语言的端到端方法。这是通过利用预训练、后训练的能力，并以最少的人工干预注入噪声标注和增强数据来实现的。它能够在阿拉伯语语音转文本领域开发新应用，如问答系统和智能控制系统，并将为智能机器人添加类人感知和听觉传感器。