Sultana Nusrat, Yasmin Rumana, Mallik Bijon, Uddin Mohammad Shorif
Department of Computer Science and Engineering, Jahangirnagar University, Dhaka, Bangladesh.
Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, Bangladesh.
Data Brief. 2025 Jan 6;58:111276. doi: 10.1016/j.dib.2025.111276. eCollection 2025 Feb.
Despite significant research on the Bangla language in Natural Language Processing (NLP), there remains a notable resource deficit for its diverse regional dialects, such as those spoken in Chittagong, Sylhet, and Barisal. These dialects, often considered unintelligible to speakers of Standard Bengali, pose challenges due to their unique grammatical structures and phonetic variations. Some linguists categorize them as distinct languages. To address this, we present ONUBAD, a large and freely available dataset for the automatic translation of Chittagong, Sylhet, and Barisal dialects into Standard Bangla using a Neural Machine Translation (NMT) system. ONUBAD provides a parallel corpus of 1540 words, 130 clauses, and 980 sentences per regional dialect and their standard counterparts along with English translation. The dataset includes metadata on phonetic variations and grammatical features, aiming to bridge the gap between standard and non-standard forms of Bangla. It serves as a valuable resource for researchers in NLP, dialect studies, and linguistic preservation, helping to develop more accurate and contextually relevant translation models. The dataset was collected between July and September 2024 from diverse sources such as books, websites, and regional people with the help of regional dialect specialists. It is hosted by the Department of Computer Science and Engineering, Jahangirnagar University, and is freely accessible at https://data.mendeley.com/datasets/6ft99kf89b/2.
尽管在自然语言处理(NLP)领域对孟加拉语进行了大量研究,但对于其多样的地区方言,如吉大港、锡尔赫特和巴里萨尔地区所使用的方言,资源仍然明显不足。这些方言,标准孟加拉语使用者通常认为难以理解,由于其独特的语法结构和语音变化而带来挑战。一些语言学家将它们归类为不同的语言。为了解决这个问题,我们推出了ONUBAD,这是一个大型且免费可用的数据集,用于使用神经机器翻译(NMT)系统将吉大港、锡尔赫特和巴里萨尔方言自动翻译成标准孟加拉语。ONUBAD为每个地区方言及其标准对应版本提供了一个包含1540个单词、130个从句和980个句子的平行语料库以及英文翻译。该数据集包括关于语音变化和语法特征的元数据,旨在弥合孟加拉语标准形式和非标准形式之间的差距。它是NLP、方言研究和语言保护领域研究人员的宝贵资源,有助于开发更准确且上下文相关的翻译模型。该数据集于2024年7月至9月期间借助地区方言专家,从书籍、网站和当地居民等不同来源收集。它由贾汗吉尔纳加尔大学计算机科学与工程系托管,可在https://data.mendeley.com/datasets/6ft99kf89b/2免费访问。