Suppr超能文献

ONUBAD:一个用于将孟加拉地方方言自动转换为标准孟加拉语方言的综合数据集。

ONUBAD: A comprehensive dataset for automated conversion of Bangla regional dialects into standard Bengali dialect.

作者信息

Sultana Nusrat, Yasmin Rumana, Mallik Bijon, Uddin Mohammad Shorif

机构信息

Department of Computer Science and Engineering, Jahangirnagar University, Dhaka, Bangladesh.

Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, Bangladesh.

出版信息

Data Brief. 2025 Jan 6;58:111276. doi: 10.1016/j.dib.2025.111276. eCollection 2025 Feb.

Abstract

Despite significant research on the Bangla language in Natural Language Processing (NLP), there remains a notable resource deficit for its diverse regional dialects, such as those spoken in Chittagong, Sylhet, and Barisal. These dialects, often considered unintelligible to speakers of Standard Bengali, pose challenges due to their unique grammatical structures and phonetic variations. Some linguists categorize them as distinct languages. To address this, we present ONUBAD, a large and freely available dataset for the automatic translation of Chittagong, Sylhet, and Barisal dialects into Standard Bangla using a Neural Machine Translation (NMT) system. ONUBAD provides a parallel corpus of 1540 words, 130 clauses, and 980 sentences per regional dialect and their standard counterparts along with English translation. The dataset includes metadata on phonetic variations and grammatical features, aiming to bridge the gap between standard and non-standard forms of Bangla. It serves as a valuable resource for researchers in NLP, dialect studies, and linguistic preservation, helping to develop more accurate and contextually relevant translation models. The dataset was collected between July and September 2024 from diverse sources such as books, websites, and regional people with the help of regional dialect specialists. It is hosted by the Department of Computer Science and Engineering, Jahangirnagar University, and is freely accessible at https://data.mendeley.com/datasets/6ft99kf89b/2.

摘要

尽管在自然语言处理(NLP)领域对孟加拉语进行了大量研究,但对于其多样的地区方言,如吉大港、锡尔赫特和巴里萨尔地区所使用的方言,资源仍然明显不足。这些方言,标准孟加拉语使用者通常认为难以理解,由于其独特的语法结构和语音变化而带来挑战。一些语言学家将它们归类为不同的语言。为了解决这个问题,我们推出了ONUBAD,这是一个大型且免费可用的数据集,用于使用神经机器翻译(NMT)系统将吉大港、锡尔赫特和巴里萨尔方言自动翻译成标准孟加拉语。ONUBAD为每个地区方言及其标准对应版本提供了一个包含1540个单词、130个从句和980个句子的平行语料库以及英文翻译。该数据集包括关于语音变化和语法特征的元数据,旨在弥合孟加拉语标准形式和非标准形式之间的差距。它是NLP、方言研究和语言保护领域研究人员的宝贵资源,有助于开发更准确且上下文相关的翻译模型。该数据集于2024年7月至9月期间借助地区方言专家,从书籍、网站和当地居民等不同来源收集。它由贾汗吉尔纳加尔大学计算机科学与工程系托管,可在https://data.mendeley.com/datasets/6ft99kf89b/2免费访问。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f56/11787450/e1f336c8dfc5/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验