孟加拉语与孟加拉英语：一个用于在语言多样化背景下进行情感检测的单语数据集。

Bengali & Banglish: A monolingual dataset for emotion detection in linguistically diverse contexts.

作者信息

Faisal Moshiur Rahman, Shifa Ashrin Mobashira, Rahman Md Hasibur, Uddin Mohammed Arif, Rahaman Rashedur M

机构信息

Department of Electrical and Computer Engineering, North South University, Dhaka-1229, Bangladesh.

出版信息

Data Brief. 2024 Jul 20;55:110760. doi: 10.1016/j.dib.2024.110760. eCollection 2024 Aug.

DOI:10.1016/j.dib.2024.110760

PMID:39183968

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11342900/

Abstract

The ever-evolving global landscape of communication, driven by Information Technology advancements, underscores the importance of emotion detection in natural language processing. However, challenges persist in interpreting emotions within linguistically diverse contexts, notably in low-resource languages like Bengali, compounded by the emergence of Banglish. To address this gap, we present "Bengali & Banglish," an extensive dataset comprising 80,098 labelled samples across six emotion classes. Our dataset fills a void in fine-grained emotion classification for Bengali and pioneers in emotion detection in Banglish. We achieve significant performance metrics through meticulous annotation and rigorous evaluation, including a weighted F1 score of 71.30% for Bengali and 64.59% for Banglish using BanglaBERT. Also, our dataset facilitates Bengali-to-Banglish Machine Translation, contributing to the advancement of language processing models. Furthermore, our dataset demonstrates a high Cohen's Kappa score of 93.5%, affirming the reliability and consistency of our annotations. This research underscores the importance of linguistic diversity in NLP and provides a valuable resource for enhancing Emotion Detection capabilities in Bengali and Banglish across digital platforms.

摘要

由信息技术进步驱动的不断演变的全球通信格局，凸显了自然语言处理中情感检测的重要性。然而，在语言多样化的背景下解读情感仍存在挑战，尤其是在孟加拉语等资源匮乏的语言中，孟加拉英语的出现更是加剧了这一问题。为了填补这一空白，我们推出了“孟加拉语和孟加拉英语”，这是一个包含80,098个标记样本、涵盖六个情感类别的广泛数据集。我们的数据集填补了孟加拉语细粒度情感分类的空白，并在孟加拉英语情感检测方面开创了先河。我们通过细致的标注和严格的评估取得了显著的性能指标，使用孟加拉语BERT模型时，孟加拉语的加权F1分数为71.30%，孟加拉英语的加权F1分数为64.59%。此外，我们的数据集促进了孟加拉语到孟加拉英语的机器翻译，推动了语言处理模型的发展。此外，我们的数据集展示了高达93.5%的科恩卡帕系数，证实了我们标注的可靠性和一致性。这项研究强调了自然语言处理中语言多样性的重要性，并为增强数字平台上孟加拉语和孟加拉英语的情感检测能力提供了宝贵资源。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

孟加拉语与孟加拉英语：一个用于在语言多样化背景下进行情感检测的单语数据集。

Bengali & Banglish: A monolingual dataset for emotion detection in linguistically diverse contexts.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

孟加拉语与孟加拉英语：一个用于在语言多样化背景下进行情感检测的单语数据集。

Bengali & Banglish: A monolingual dataset for emotion detection in linguistically diverse contexts.

作者信息

机构信息

出版信息

相似文献

本文引用的文献