Suppr超能文献

一种识别低资源语言中讽刺意味的自动化方法。

An automated approach to identify sarcasm in low-resource language.

作者信息

Khan Shumaila, Qasim Iqbal, Khan Wahab, Khan Aurangzeb, Ali Khan Javed, Qahmash Ayman, Ghadi Yazeed Yasin

机构信息

Institute of CS & IT, University of Science & Technology, Bannu, Pakistan.

Department of Computer Science, School of Physics, Engineering & Computer Science, University of Hertfordshire, Hatfield, United Kingdom.

出版信息

PLoS One. 2024 Dec 5;19(12):e0307186. doi: 10.1371/journal.pone.0307186. eCollection 2024.

Abstract

Sarcasm detection has emerged due to its applicability in natural language processing (NLP) but lacks substantial exploration in low-resource languages like Urdu, Arabic, Pashto, and Roman-Urdu. While fewer studies identifying sarcasm have focused on low-resource languages, most of the work is in English. This research addresses the gap by exploring the efficacy of diverse machine learning (ML) algorithms in identifying sarcasm in Urdu. The scarcity of annotated datasets for low-resource language becomes a challenge. To overcome the challenge, we curated and released a comparatively large dataset named Urdu Sarcastic Tweets (UST) Dataset, comprising user-generated comments from [Formula: see text] (former Twitter). Automatic sarcasm detection in text involves using computational methods to determine if a given statement is intended to be sarcastic. However, this task is challenging due to the influence of the user's behavior and attitude and their expression of emotions. To address this challenge, we employ various baseline ML classifiers to evaluate their effectiveness in detecting sarcasm in low-resource languages. The primary models evaluated in this study are support vector machine (SVM), decision tree (DT), K-Nearest Neighbor Classifier (K-NN), linear regression (LR), random forest (RF), Naïve Bayes (NB), and XGBoost. Our study's assessment involved validating the performance of these ML classifiers on two distinct datasets-the Tanz-Indicator and the UST dataset. The SVM classifier consistently outperformed other ML models with an accuracy of 0.85 across various experimental setups. This research underscores the importance of tailored sarcasm detection approaches to accommodate specific linguistic characteristics in low-resource languages, paving the way for future investigations. By providing open access to the UST dataset, we encourage its use as a benchmark for sarcasm detection research in similar linguistic contexts.

摘要

由于讽刺检测在自然语言处理(NLP)中的适用性,它已逐渐兴起,但在乌尔都语、阿拉伯语、普什图语和罗马乌尔都语等低资源语言中缺乏实质性的探索。虽然识别讽刺的研究较少关注低资源语言,但大多数工作是用英语进行的。本研究通过探索多种机器学习(ML)算法在识别乌尔都语讽刺言论方面的有效性来填补这一空白。低资源语言注释数据集的稀缺成为一个挑战。为了克服这一挑战,我们精心策划并发布了一个相对较大的数据集,名为乌尔都语讽刺推文(UST)数据集,它包含来自[公式:见文本](前推特)的用户生成评论。文本中的自动讽刺检测涉及使用计算方法来确定给定语句是否意在讽刺。然而,由于用户行为和态度及其情感表达的影响,这项任务具有挑战性。为了应对这一挑战,我们采用各种基线ML分类器来评估它们在检测低资源语言讽刺言论方面的有效性。本研究中评估的主要模型有支持向量机(SVM)、决策树(DT)、K近邻分类器(K-NN)、线性回归(LR)、随机森林(RF)、朴素贝叶斯(NB)和XGBoost。我们研究的评估包括在两个不同的数据集——坦桑尼亚指标数据集和UST数据集上验证这些ML分类器的性能。在各种实验设置中,SVM分类器始终以0.85的准确率优于其他ML模型。这项研究强调了定制讽刺检测方法以适应低资源语言特定语言特征的重要性,为未来的研究铺平了道路。通过提供对UST数据集的开放访问,我们鼓励将其用作类似语言环境中讽刺检测研究的基准。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验