多语言文本分类与情感分析：对用于推特数据分类的多语言方法利用情况的比较分析。

Multilingual text categorization and sentiment analysis: a comparative analysis of the utilization of multilingual approaches for classifying twitter data.

作者信息

Manias George, Mavrogiorgou Argyro, Kiourtis Athanasios, Symvoulidis Chrysostomos, Kyriazis Dimosthenis

机构信息

University of Piraeus, Piraeus, Greece.

出版信息

Neural Comput Appl. 2023 May 8:1-17. doi: 10.1007/s00521-023-08629-3.

DOI:10.1007/s00521-023-08629-3

PMID:37362579

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10165589/

Abstract

Text categorization and sentiment analysis are two of the most typical natural language processing tasks with various emerging applications implemented and utilized in different domains, such as health care and policy making. At the same time, the tremendous growth in the popularity and usage of social media, such as Twitter, has resulted on an immense increase in user-generated data, as mainly represented by the corresponding texts in users' posts. However, the analysis of these specific data and the extraction of actionable knowledge and added value out of them is a challenging task due to the domain diversity and the high multilingualism that characterizes these data. The latter highlights the emerging need for the implementation and utilization of domain-agnostic and multilingual solutions. To investigate a portion of these challenges this research work performs a comparative analysis of multilingual approaches for classifying both the sentiment and the text of an examined multilingual corpus. In this context, four multilingual BERT-based classifiers and a zero-shot classification approach are utilized and compared in terms of their accuracy and applicability in the classification of multilingual data. Their comparison has unveiled insightful outcomes and has a twofold interpretation. Multilingual BERT-based classifiers achieve high performances and transfer inference when trained and fine-tuned on multilingual data. While also the zero-shot approach presents a novel technique for creating multilingual solutions in a faster, more efficient, and scalable way. It can easily be fitted to new languages and new tasks while achieving relatively good results across many languages. However, when efficiency and scalability are less important than accuracy, it seems that this model, and zero-shot models in general, can not be compared to fine-tuned and trained multilingual BERT-based classifiers.

摘要

文本分类和情感分析是两个最典型的自然语言处理任务，在医疗保健和政策制定等不同领域有各种新兴应用得以实施和利用。与此同时，社交媒体（如推特）的普及和使用量的巨大增长，导致用户生成数据大幅增加，主要表现为用户帖子中的相应文本。然而，由于这些数据具有领域多样性和高度多语言性，对这些特定数据进行分析并从中提取可操作的知识和附加值是一项具有挑战性的任务。后者凸显了对实施和利用领域无关和多语言解决方案的新需求。为了研究其中一部分挑战，本研究工作对用于对一个多语言语料库的情感和文本进行分类的多语言方法进行了比较分析。在这种情况下，使用了四个基于多语言BERT的分类器和一种零样本分类方法，并在多语言数据分类的准确性和适用性方面进行了比较。它们的比较揭示了有见地的结果，并具有双重解读。基于多语言BERT的分类器在对多语言数据进行训练和微调时能实现高性能和迁移推理。同时，零样本方法也提出了一种以更快、更高效和可扩展的方式创建多语言解决方案的新技术。它可以很容易地适用于新语言和新任务，同时在多种语言中取得相对较好的结果。然而，当效率和可扩展性不如准确性重要时，似乎这个模型以及一般的零样本模型无法与经过微调并训练的基于多语言BERT的分类器相媲美。