Suppr超能文献

针对资源匮乏的刚果语言的语音识别数据集。

Speech recognition datasets for low-resource Congolese languages.

作者信息

Kimanuka Ussen, Maina Ciira Wa, Büyük Osman

机构信息

Department of Electrical Engineering, Pan African University Institute for Basic Sciences, Technology and Innovation, Nairobi, Kenya.

Department of Electrical Engineering, Dedan Kimathi University of Technology, Nyeri, Kenya.

出版信息

Data Brief. 2023 Nov 10;52:109796. doi: 10.1016/j.dib.2023.109796. eCollection 2024 Feb.

Abstract

Large pre-trained Automatic Speech Recognition (ASR) models have shown improved performance in low-resource languages due to the increased availability of benchmark corpora and the advantages of transfer learning. However, only a limited number of languages possess ample resources to fully leverage transfer learning. In such contexts, benchmark corpora become crucial for advancing methods. In this article, we introduce two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: the Lingala Read Speech Corpus, with 4 h of labelled audio, and the Congolese Speech Radio Corpus, which offers 741 h of unlabelled audio spanning four significant low-resource languages of the region. During data collection, Lingala Read Speech recordings of thirty-two distinct adult speakers, each with a unique context under various settings with different accents, were recorded. Concurrently, Congolese Speech Radio raw data were taken from the archive of broadcast station, followed by a designed curation process. During data preparation, numerous strategies have been utilised for pre-processing the data. The datasets, which have been made freely accessible to all researchers, serve as a valuable resource for not only investigating and developing monolingual methods and approaches that employ linguistically distant languages but also multilingual approaches with linguistically similar languages. Using techniques such as supervised learning and self-supervised learning, they are able to develop inaugural benchmarking of speech recognition systems for Lingala and mark the first instance of a multilingual model tailored for four Congolese languages spoken by an aggregated population of 95 million. Moreover, two models were applied to this dataset. The first is supervised learning modelling and the second is for self-supervised pre-training.

摘要

大型预训练自动语音识别(ASR)模型由于基准语料库的可用性增加以及迁移学习的优势,在低资源语言中表现出了更好的性能。然而,只有少数语言拥有充足的资源来充分利用迁移学习。在这种情况下,基准语料库对于推进相关方法至关重要。在本文中,我们介绍了两个专门为刚果民主共和国的低资源语言设计的新基准语料库:林加拉语朗读语音语料库,包含4小时的标注音频;以及刚果语音广播语料库,提供741小时未标注音频,涵盖该地区四种重要的低资源语言。在数据收集过程中,记录了32位不同成年说话者的林加拉语朗读语音,每位说话者在不同环境、不同口音下都有独特的语境。同时,刚果语音广播原始数据取自广播电台档案,随后进行了精心策划的整理过程。在数据准备阶段,采用了多种策略对数据进行预处理。这些数据集已向所有研究人员免费开放,不仅是研究和开发使用语言差异较大语言的单语方法和途径的宝贵资源,也是研究使用语言相似语言的多语方法的宝贵资源。利用监督学习和自监督学习等技术,他们能够为林加拉语开发语音识别系统的首个基准测试,并标志着为9500万人口使用的四种刚果语言量身定制的多语模型的首次实例。此外,将两个模型应用于该数据集。第一个是监督学习建模,第二个是用于自监督预训练。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7763/10700368/416843bd16e5/gr1.jpg

相似文献

1
Speech recognition datasets for low-resource Congolese languages.
Data Brief. 2023 Nov 10;52:109796. doi: 10.1016/j.dib.2023.109796. eCollection 2024 Feb.
2
Real and synthetic Punjabi speech datasets for automatic speech recognition.
Data Brief. 2023 Nov 27;52:109865. doi: 10.1016/j.dib.2023.109865. eCollection 2024 Feb.
3
Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets.
Sci Rep. 2024 Jun 15;14(1):13835. doi: 10.1038/s41598-024-64848-1.
4
A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.
PLoS One. 2019 Aug 15;14(8):e0220386. doi: 10.1371/journal.pone.0220386. eCollection 2019.
5
Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario.
Sensors (Basel). 2021 Dec 12;21(24):8313. doi: 10.3390/s21248313.
6
Offensive language detection in low resource languages: A use case of Persian language.
PLoS One. 2024 Jun 21;19(6):e0304166. doi: 10.1371/journal.pone.0304166. eCollection 2024.
7
Advances in Completely Automated Vowel Analysis for Sociophonetics: Using End-to-End Speech Recognition Systems With DARLA.
Front Artif Intell. 2021 Sep 24;4:662097. doi: 10.3389/frai.2021.662097. eCollection 2021.
8
Cross-lingual hate speech detection using domain-specific word embeddings.
PLoS One. 2024 Jul 30;19(7):e0306521. doi: 10.1371/journal.pone.0306521. eCollection 2024.
9
BanglaSER: A speech emotion recognition dataset for the Bangla language.
Data Brief. 2022 Mar 22;42:108091. doi: 10.1016/j.dib.2022.108091. eCollection 2022 Jun.
10
YembaTones: A syllable-tone annotated dataset for speech recognition and prosodic analysis of the Yemba language.
Data Brief. 2023 Nov 27;52:109860. doi: 10.1016/j.dib.2023.109860. eCollection 2024 Feb.

引用本文的文献

本文引用的文献

1
Finnish parliament ASR corpus: Analysis, benchmarks and statistics.
Lang Resour Eval. 2023 Mar 27:1-26. doi: 10.1007/s10579-023-09650-7.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验