Ashraf Noman, Khan Lal, Butt Sabur, Chang Hsien-Tsung, Sidorov Grigori, Gelbukh Alexander
CIC, Instituto Politécnico Nacional, Mexico City, Mexico.
Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan.
PeerJ Comput Sci. 2022 Apr 22;8:e896. doi: 10.7717/peerj-cs.896. eCollection 2022.
Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.
乌尔都语在南亚及全球范围内广泛使用。虽然有类似的英文数据集,但我们创建了首个多标签情感数据集,该数据集由6043条推文组成,采用乌尔都纳斯塔利克字体书写,并包含六种基本情感。我们采用多标签(ML)分类方法来检测乌尔都语中的情感。乌尔都语的形态和句法结构使其成为多标签情感检测中的一个具有挑战性的问题。在本文中,我们构建了一组基线分类器,如机器学习算法(随机森林(RF)、决策树(J48)、序列最小优化(SMO)、AdaBoostM1和Bagging)、深度学习算法(一维卷积神经网络(1D-CNN)、长短期记忆网络(LSTM)以及具有CNN特征的LSTM)和基于Transformer的基线(BERT)。我们使用了多种文本表示方法的组合:基于文体特征的特征、预训练词嵌入、基于词的n元语法和基于字符的n元语法。本文重点介绍了注释指南、数据集特征以及对用于乌尔都语情感分类的不同方法的见解。我们给出了所有测试方法在微平均F1、宏平均F1、准确率、汉明损失(HL)和精确匹配(EM)方面的最佳结果。