针对资源匮乏的刚果语言的语音识别数据集。

Speech recognition datasets for low-resource Congolese languages.

作者信息

Kimanuka Ussen, Maina Ciira Wa, Büyük Osman

机构信息

Department of Electrical Engineering, Pan African University Institute for Basic Sciences, Technology and Innovation, Nairobi, Kenya.

Department of Electrical Engineering, Dedan Kimathi University of Technology, Nyeri, Kenya.

出版信息

Data Brief. 2023 Nov 10;52:109796. doi: 10.1016/j.dib.2023.109796. eCollection 2024 Feb.

DOI:10.1016/j.dib.2023.109796

PMID:38076471

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10700368/

Abstract

Large pre-trained Automatic Speech Recognition (ASR) models have shown improved performance in low-resource languages due to the increased availability of benchmark corpora and the advantages of transfer learning. However, only a limited number of languages possess ample resources to fully leverage transfer learning. In such contexts, benchmark corpora become crucial for advancing methods. In this article, we introduce two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: the Lingala Read Speech Corpus, with 4 h of labelled audio, and the Congolese Speech Radio Corpus, which offers 741 h of unlabelled audio spanning four significant low-resource languages of the region. During data collection, Lingala Read Speech recordings of thirty-two distinct adult speakers, each with a unique context under various settings with different accents, were recorded. Concurrently, Congolese Speech Radio raw data were taken from the archive of broadcast station, followed by a designed curation process. During data preparation, numerous strategies have been utilised for pre-processing the data. The datasets, which have been made freely accessible to all researchers, serve as a valuable resource for not only investigating and developing monolingual methods and approaches that employ linguistically distant languages but also multilingual approaches with linguistically similar languages. Using techniques such as supervised learning and self-supervised learning, they are able to develop inaugural benchmarking of speech recognition systems for Lingala and mark the first instance of a multilingual model tailored for four Congolese languages spoken by an aggregated population of 95 million. Moreover, two models were applied to this dataset. The first is supervised learning modelling and the second is for self-supervised pre-training.

摘要

大型预训练自动语音识别（ASR）模型由于基准语料库的可用性增加以及迁移学习的优势，在低资源语言中表现出了更好的性能。然而，只有少数语言拥有充足的资源来充分利用迁移学习。在这种情况下，基准语料库对于推进相关方法至关重要。在本文中，我们介绍了两个专门为刚果民主共和国的低资源语言设计的新基准语料库：林加拉语朗读语音语料库，包含4小时的标注音频；以及刚果语音广播语料库，提供741小时未标注音频，涵盖该地区四种重要的低资源语言。在数据收集过程中，记录了32位不同成年说话者的林加拉语朗读语音，每位说话者在不同环境、不同口音下都有独特的语境。同时，刚果语音广播原始数据取自广播电台档案，随后进行了精心策划的整理过程。在数据准备阶段，采用了多种策略对数据进行预处理。这些数据集已向所有研究人员免费开放，不仅是研究和开发使用语言差异较大语言的单语方法和途径的宝贵资源，也是研究使用语言相似语言的多语方法的宝贵资源。利用监督学习和自监督学习等技术，他们能够为林加拉语开发语音识别系统的首个基准测试，并标志着为9500万人口使用的四种刚果语言量身定制的多语模型的首次实例。此外，将两个模型应用于该数据集。第一个是监督学习建模，第二个是用于自监督预训练。