BI-SENT：乌尔都语中关于新冠疫情推文的基于方面的双语情感分析

BI-SENT: bilingual aspect-based sentiment analysis of COVID-19 Tweets in Urdu language.

作者信息

Hashmi Ehtesham, Altaf Amna, Anwar Muhammad Waqas, Jamal Muhammad Hasan, Bajwa Usama Ijaz

机构信息

Department of Information Security and Communication Technology, Norwegian University of Science and Technology, Innlandet, Norway.

Department of Computer Science, COMSATS University Islamabad, Lahore, Pakistan.

出版信息

PLoS One. 2025 Jun 13;20(6):e0317562. doi: 10.1371/journal.pone.0317562. eCollection 2025.

DOI:10.1371/journal.pone.0317562

PMID:40512833

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12165425/

Abstract

The COVID-19 pandemic resulted in over 600 million cases worldwide, and significantly impacted both physical and mental health, fostering widespread anxiety and fear. Consequently, the extensive use of online social networks to express emotions made sentiment analysis a crucial tool for understanding public sentiment. Traditionally, sentiment analysis in the Urdu language has focused on sentence-level analysis. However, aspect-level sentiment analysis is increasingly important and remains underexplored due to the challenges of the costly and time-consuming manual dataset annotation process. This study presents an innovative bilingual aspect-based sentiment analysis for Urdu and Roman Urdu using unsupervised methods. For Urdu, a syntactic rule-based approach achieves an accuracy of 83% in extracting aspect terms, marking a 5% improvement in F1-score over existing methods. For Roman Urdu, the study employs collocation patterns and topic modeling to identify and categorize key aspects, resulting in a perplexity score of -7 and a coherence score of 41. The results not only demonstrate the semantic coherence of the identified categories but also represent a significant advancement in aspect-level sentiment analysis by eliminating the need for manual annotation. This study offers new insights into the sentiments expressed during the pandemic, providing valuable feedback for policymakers and health organizations.

摘要

新冠疫情在全球导致了超过6亿例病例，对身心健康都产生了重大影响，引发了广泛的焦虑和恐惧。因此，人们广泛使用在线社交网络来表达情绪，这使得情感分析成为理解公众情绪的关键工具。传统上，乌尔都语的情感分析侧重于句子层面的分析。然而，由于昂贵且耗时的人工数据集标注过程面临挑战，方面级情感分析变得越来越重要且仍未得到充分探索。本研究提出了一种使用无监督方法对乌尔都语和罗马乌尔都语进行创新的基于方面的双语情感分析。对于乌尔都语，一种基于句法规则的方法在提取方面术语时达到了83%的准确率，F1分数比现有方法提高了5%。对于罗马乌尔都语，该研究采用搭配模式和主题建模来识别和分类关键方面，困惑度得分为-7，连贯得分为41。结果不仅证明了所识别类别的语义连贯性，还通过消除人工标注的需求在方面级情感分析方面取得了重大进展。本研究为疫情期间表达的情绪提供了新的见解，为政策制定者和卫生组织提供了有价值的反馈。