ONUBAD：一个用于将孟加拉地方方言自动转换为标准孟加拉语方言的综合数据集。

ONUBAD: A comprehensive dataset for automated conversion of Bangla regional dialects into standard Bengali dialect.

作者信息

Sultana Nusrat, Yasmin Rumana, Mallik Bijon, Uddin Mohammad Shorif

机构信息

Department of Computer Science and Engineering, Jahangirnagar University, Dhaka, Bangladesh.

Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, Bangladesh.

出版信息

Data Brief. 2025 Jan 6;58:111276. doi: 10.1016/j.dib.2025.111276. eCollection 2025 Feb.

DOI:10.1016/j.dib.2025.111276

PMID:39895658

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11787450/

Abstract

Despite significant research on the Bangla language in Natural Language Processing (NLP), there remains a notable resource deficit for its diverse regional dialects, such as those spoken in Chittagong, Sylhet, and Barisal. These dialects, often considered unintelligible to speakers of Standard Bengali, pose challenges due to their unique grammatical structures and phonetic variations. Some linguists categorize them as distinct languages. To address this, we present ONUBAD, a large and freely available dataset for the automatic translation of Chittagong, Sylhet, and Barisal dialects into Standard Bangla using a Neural Machine Translation (NMT) system. ONUBAD provides a parallel corpus of 1540 words, 130 clauses, and 980 sentences per regional dialect and their standard counterparts along with English translation. The dataset includes metadata on phonetic variations and grammatical features, aiming to bridge the gap between standard and non-standard forms of Bangla. It serves as a valuable resource for researchers in NLP, dialect studies, and linguistic preservation, helping to develop more accurate and contextually relevant translation models. The dataset was collected between July and September 2024 from diverse sources such as books, websites, and regional people with the help of regional dialect specialists. It is hosted by the Department of Computer Science and Engineering, Jahangirnagar University, and is freely accessible at https://data.mendeley.com/datasets/6ft99kf89b/2.

摘要

尽管在自然语言处理（NLP）领域对孟加拉语进行了大量研究，但对于其多样的地区方言，如吉大港、锡尔赫特和巴里萨尔地区所使用的方言，资源仍然明显不足。这些方言，标准孟加拉语使用者通常认为难以理解，由于其独特的语法结构和语音变化而带来挑战。一些语言学家将它们归类为不同的语言。为了解决这个问题，我们推出了ONUBAD，这是一个大型且免费可用的数据集，用于使用神经机器翻译（NMT）系统将吉大港、锡尔赫特和巴里萨尔方言自动翻译成标准孟加拉语。ONUBAD为每个地区方言及其标准对应版本提供了一个包含1540个单词、130个从句和980个句子的平行语料库以及英文翻译。该数据集包括关于语音变化和语法特征的元数据，旨在弥合孟加拉语标准形式和非标准形式之间的差距。它是NLP、方言研究和语言保护领域研究人员的宝贵资源，有助于开发更准确且上下文相关的翻译模型。该数据集于2024年7月至9月期间借助地区方言专家，从书籍、网站和当地居民等不同来源收集。它由贾汗吉尔纳加尔大学计算机科学与工程系托管，可在https://data.mendeley.com/datasets/6ft99kf89b/2免费访问。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f56/11787450/e1f336c8dfc5/gr1.jpg

相似文献

ONUBAD: A comprehensive dataset for automated conversion of Bangla regional dialects into standard Bengali dialect.ONUBAD：一个用于将孟加拉地方方言自动转换为标准孟加拉语方言的综合数据集。

Data Brief. 2025 Jan 6;58:111276. doi: 10.1016/j.dib.2025.111276. eCollection 2025 Feb.

ChatgaiyyaAlap: A dataset for conversion from Chittagonian dialect to standard Bangla.Chatgaiyya阿拉普语：一个用于将吉大港方言转换为标准孟加拉语的数据集。

Data Brief. 2025 Feb 21;59:111413. doi: 10.1016/j.dib.2025.111413. eCollection 2025 Apr.

BanglaTense: A large-scale dataset of Bangla sentences categorized by tense: Past, present, and future.孟加拉语时态：一个按过去、现在和将来时态分类的孟加拉语句子大规模数据集。

Data Brief. 2025 Feb 19;59:111400. doi: 10.1016/j.dib.2025.111400. eCollection 2025 Apr.

BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla language.孟加拉语混合语料库：一个大规模的孟加拉语句子诺贝尔奖数据集，按孟加拉语的圣语和通用形式分类。

Data Brief. 2024 Dec 20;58:111240. doi: 10.1016/j.dib.2024.111240. eCollection 2025 Feb.

BTSD: A curated transformation of sentence dataset for text classification in Bangla language.BTSD：孟加拉语用于文本分类的句子数据集的精心整理转换。

Data Brief. 2023 Jul 24;50:109445. doi: 10.1016/j.dib.2023.109445. eCollection 2023 Oct.

BAAD: A multipurpose dataset for automatic Bangla offensive speech recognition.BAAD：一个用于自动孟加拉语攻击性语音识别的多用途数据集。

Data Brief. 2023 Mar 24;48:109067. doi: 10.1016/j.dib.2023.109067. eCollection 2023 Jun.

In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing.在斯瓦希里语的核心地带：自然语言处理中数据收集方法与语料库构建的探索

Data Brief. 2024 Jul 17;55:110751. doi: 10.1016/j.dib.2024.110751. eCollection 2024 Aug.

BanglaSER: A speech emotion recognition dataset for the Bangla language.孟加拉语SER：一个用于孟加拉语的语音情感识别数据集。

Data Brief. 2022 Mar 22;42:108091. doi: 10.1016/j.dib.2022.108091. eCollection 2022 Jun.

BaitBuster-Bangla: A comprehensive dataset for clickbait detection in Bangla with multi-feature and multi-modal analysis.《诱饵克星-孟加拉语：一个用于孟加拉语标题党检测的综合数据集，具有多特征和多模态分析》

Data Brief. 2024 Feb 27;53:110239. doi: 10.1016/j.dib.2024.110239. eCollection 2024 Apr.

Improving neural machine translation for low resource languages through non-parallel corpora: a case study of Egyptian dialect to modern standard Arabic translation.通过非平行语料库改进低资源语言的神经机器翻译：以埃及方言到现代标准阿拉伯语的翻译为例

Sci Rep. 2024 Jan 27;14(1):2265. doi: 10.1038/s41598-023-51090-4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

ONUBAD：一个用于将孟加拉地方方言自动转换为标准孟加拉语方言的综合数据集。

ONUBAD: A comprehensive dataset for automated conversion of Bangla regional dialects into standard Bengali dialect.

作者信息

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献