Faisal Moshiur Rahman, Shifa Ashrin Mobashira, Rahman Md Hasibur, Uddin Mohammed Arif, Rahaman Rashedur M
Department of Electrical and Computer Engineering, North South University, Dhaka-1229, Bangladesh.
Data Brief. 2024 Jul 20;55:110760. doi: 10.1016/j.dib.2024.110760. eCollection 2024 Aug.
The ever-evolving global landscape of communication, driven by Information Technology advancements, underscores the importance of emotion detection in natural language processing. However, challenges persist in interpreting emotions within linguistically diverse contexts, notably in low-resource languages like Bengali, compounded by the emergence of Banglish. To address this gap, we present "Bengali & Banglish," an extensive dataset comprising 80,098 labelled samples across six emotion classes. Our dataset fills a void in fine-grained emotion classification for Bengali and pioneers in emotion detection in Banglish. We achieve significant performance metrics through meticulous annotation and rigorous evaluation, including a weighted F1 score of 71.30% for Bengali and 64.59% for Banglish using BanglaBERT. Also, our dataset facilitates Bengali-to-Banglish Machine Translation, contributing to the advancement of language processing models. Furthermore, our dataset demonstrates a high Cohen's Kappa score of 93.5%, affirming the reliability and consistency of our annotations. This research underscores the importance of linguistic diversity in NLP and provides a valuable resource for enhancing Emotion Detection capabilities in Bengali and Banglish across digital platforms.
由信息技术进步驱动的不断演变的全球通信格局,凸显了自然语言处理中情感检测的重要性。然而,在语言多样化的背景下解读情感仍存在挑战,尤其是在孟加拉语等资源匮乏的语言中,孟加拉英语的出现更是加剧了这一问题。为了填补这一空白,我们推出了“孟加拉语和孟加拉英语”,这是一个包含80,098个标记样本、涵盖六个情感类别的广泛数据集。我们的数据集填补了孟加拉语细粒度情感分类的空白,并在孟加拉英语情感检测方面开创了先河。我们通过细致的标注和严格的评估取得了显著的性能指标,使用孟加拉语BERT模型时,孟加拉语的加权F1分数为71.30%,孟加拉英语的加权F1分数为64.59%。此外,我们的数据集促进了孟加拉语到孟加拉英语的机器翻译,推动了语言处理模型的发展。此外,我们的数据集展示了高达93.5%的科恩卡帕系数,证实了我们标注的可靠性和一致性。这项研究强调了自然语言处理中语言多样性的重要性,并为增强数字平台上孟加拉语和孟加拉英语的情感检测能力提供了宝贵资源。