Suppr超能文献

HiACC:印式英语成人与儿童语码转换语料库。

HiACC: Hinglish adult & children code-switched corpus.

作者信息

Singh Shruti, Singh Muskaan, Kadyan Virender

机构信息

SoCS, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India.

SCIES, Ulster University, Northland Road, Londonderry, UK.

出版信息

Data Brief. 2025 Jul 17;62:111886. doi: 10.1016/j.dib.2025.111886. eCollection 2025 Oct.

Abstract

Code-switching is the frequent alternation between two or more languages within a single utterance and is a widespread phenomenon among bilingual and multilingual speakers. In India, more than 250 million people are estimated to engage in code-switched communication, especially blending English with Hindi (Hinglish), making it one of the largest bilingual populations globally, making challenging for developing accurate and robust Automatic Speech Recognition (ASR) systems. Existing ASR models, typically trained on monolingual corpus, struggle with code-switched input due to a lack of large, balanced, and representative datasets-particularly for diverse age groups. Recent evaluations have shown that ASR models experience a relative increase in Word Error Rate (WER) of 30-50 % when exposed to code-switched speech compared to monolingual input. To address this resource gap, we introduce a benchmark Hinglish speech corpus, to improve ASR performance in resource-constrained settings. While several monolingual Hindi and English corpus exist, publicly available code-switched datasets remain scarce, and none till date include children's speech. Our corpus fills this gap by providing the first code-switched Hinglish speech dataset with recordings from both adults and children. It comprises 3,318 audio segments from adult participants and 1,858 segments from children, covering 5.24 hours of read and spontaneous speech. The transcriptions include detailed annotations and code-switching tags to assist in linguistic and computational analysis. The corpus is publicly available at [https://zenodo.org/records/15551669], offering segmented audio and aligned transcripts for open research. We also present baseline ASR experiments, which show that standard models trained on monolingual data underperform by approximately 42 % WER on our test set, highlighting the complexity of the task. To our knowledge, this is the first publicly available resource on code-switched Hinglish speech encompassing both adult and child speakers, designed to catalyse progress in this challenging yet important area of speech recognition.

摘要

语码转换是指在单个话语中频繁交替使用两种或更多语言,这在双语和多语使用者中是一种普遍现象。在印度,估计有超过2.5亿人进行语码转换交流,尤其是将英语与印地语混合(印式英语),使其成为全球最大的双语群体之一,这给开发准确且强大的自动语音识别(ASR)系统带来了挑战。现有的ASR模型通常在单语语料库上进行训练,由于缺乏大型、平衡且具有代表性的数据集,尤其是针对不同年龄组的数据集,因此在处理语码转换输入时存在困难。最近的评估表明,与单语输入相比,当ASR模型接触语码转换语音时,其单词错误率(WER)相对增加30 - 50%。为了弥补这一资源缺口,我们引入了一个基准印式英语语音语料库,以提高资源受限环境下的ASR性能。虽然存在几个单语的印地语和英语语料库,但公开可用的语码转换数据集仍然稀缺,而且迄今为止没有一个包含儿童语音。我们的语料库通过提供第一个包含成人和儿童录音的语码转换印式英语语音数据集填补了这一空白。它包括来自成人参与者的3318个音频片段和来自儿童的1858个片段,涵盖了5.24小时的朗读和自发语音。转录内容包括详细的注释和语码转换标签,以协助进行语言和计算分析。该语料库可在[https://zenodo.org/records/15551669]上公开获取,提供分段音频和对齐的转录文本以供开放研究使用。我们还展示了基准ASR实验,结果表明在单语数据上训练的标准模型在我们的测试集上的WER比预期差约42%,突出了该任务的复杂性。据我们所知,这是第一个公开可用的涵盖成人和儿童使用者的语码转换印式英语语音资源,旨在推动这一具有挑战性但重要的语音识别领域的进展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4faf/12329218/feecc33884e7/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验