用于确定不平衡多类语料库中情感极性的集成学习进化优化

Evolutionary Optimization of Ensemble Learning to Determine Sentiment Polarity in an Unbalanced Multiclass Corpus.

作者信息

García-Mendoza Consuelo V, Gambino Omar J, Villarreal-Cervantes Miguel G, Calvo Hiram

机构信息

Escuela Superior de Cómputo, Instituto Politécnico Nacional, Mexico City 07738, Mexico.

Centro de Innovación y Desarrollo Tecnológico en Cómputo, Instituto Politécnico Nacional, Mexico City 07700, Mexico.

出版信息

Entropy (Basel). 2020 Sep 12;22(9):1020. doi: 10.3390/e22091020.

DOI:10.3390/e22091020

PMID:33286789

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7597113/

Abstract

Sentiment polarity classification in social media is a very important task, as it enables gathering trends on particular subjects given a set of opinions. Currently, a great advance has been made by using deep learning techniques, such as word embeddings, recurrent neural networks, and encoders, such as BERT. Unfortunately, these techniques require large amounts of data, which, in some cases, is not available. In order to model this situation, challenges, such as the Spanish TASS organized by the Spanish Society for Natural Language Processing (SEPLN), have been proposed, which pose particular difficulties: First, an unwieldy balance in the training and the test set, being this latter more than eight times the size of the training set. Another difficulty is the marked unbalance in the distribution of classes, which is also different between both sets. Finally, there are four different labels, which create the need to adapt current classifications methods for multiclass handling. Traditional machine learning methods, such as Naïve Bayes, Logistic Regression, and Support Vector Machines, achieve modest performance in these conditions, but used as an ensemble it is possible to attain competitive execution. Several strategies to build classifier ensembles have been proposed; this paper proposes estimating an optimal weighting scheme using a Differential Evolution algorithm focused on dealing with particular issues that multiclass classification and unbalanced corpora pose. The ensemble with the proposed optimized weighting scheme is able to improve the classification results on the full test set of the TASS challenge (General corpus), achieving state of the art performance when compared with other works on this task, which make no use of NLP techniques.

摘要

社交媒体中的情感极性分类是一项非常重要的任务，因为它能够根据一系列观点收集特定主题的趋势。目前，通过使用深度学习技术，如词嵌入、循环神经网络以及像BERT这样的编码器，已经取得了巨大进展。不幸的是，这些技术需要大量数据，而在某些情况下，这些数据是无法获得的。为了模拟这种情况，已经提出了一些挑战，比如由西班牙自然语言处理协会（SEPLN）组织的西班牙TASS，它带来了一些特殊困难：首先，训练集和测试集的平衡难以处理，测试集的规模超过训练集的八倍。另一个困难是类分布的明显不平衡，并且两个集合之间也有所不同。最后，有四个不同的标签，这就需要调整当前的多类处理分类方法。传统的机器学习方法，如朴素贝叶斯、逻辑回归和支持向量机，在这些条件下表现一般，但作为一个集成模型使用时，有可能获得有竞争力的执行效果。已经提出了几种构建分类器集成的策略；本文提出使用差分进化算法估计一种最优加权方案，该算法专注于处理多类分类和不平衡语料库带来的特定问题。具有所提出的优化加权方案的集成模型能够在TASS挑战（通用语料库）的完整测试集上提高分类结果，与该任务中其他未使用自然语言处理技术的作品相比，达到了当前的最优性能。