乌尔都语推文的多标签情感分类

Multi-label emotion classification of Urdu tweets.

作者信息

Ashraf Noman, Khan Lal, Butt Sabur, Chang Hsien-Tsung, Sidorov Grigori, Gelbukh Alexander

机构信息

CIC, Instituto Politécnico Nacional, Mexico City, Mexico.

Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan.

出版信息

PeerJ Comput Sci. 2022 Apr 22;8:e896. doi: 10.7717/peerj-cs.896. eCollection 2022.

DOI:10.7717/peerj-cs.896

PMID:35494831

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9044368/

Abstract

Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.

摘要

乌尔都语在南亚及全球范围内广泛使用。虽然有类似的英文数据集，但我们创建了首个多标签情感数据集，该数据集由6043条推文组成，采用乌尔都纳斯塔利克字体书写，并包含六种基本情感。我们采用多标签（ML）分类方法来检测乌尔都语中的情感。乌尔都语的形态和句法结构使其成为多标签情感检测中的一个具有挑战性的问题。在本文中，我们构建了一组基线分类器，如机器学习算法（随机森林（RF）、决策树（J48）、序列最小优化（SMO）、AdaBoostM1和Bagging）、深度学习算法（一维卷积神经网络（1D-CNN）、长短期记忆网络（LSTM）以及具有CNN特征的LSTM）和基于Transformer的基线（BERT）。我们使用了多种文本表示方法的组合：基于文体特征的特征、预训练词嵌入、基于词的n元语法和基于字符的n元语法。本文重点介绍了注释指南、数据集特征以及对用于乌尔都语情感分类的不同方法的见解。我们给出了所有测试方法在微平均F1、宏平均F1、准确率、汉明损失（HL）和精确匹配（EM）方面的最佳结果。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

乌尔都语推文的多标签情感分类

Multi-label emotion classification of Urdu tweets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

乌尔都语推文的多标签情感分类

Multi-label emotion classification of Urdu tweets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献