Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia.
INESC TEC, Porto, Portugal.
PLoS One. 2018 Mar 13;13(3):e0194317. doi: 10.1371/journal.pone.0194317. eCollection 2018.
Social media are becoming an increasingly important source of information about the public mood regarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentiment classification of Twitter data. Construction of sentiment classifiers is a standard text mining task, but here we address the question of how to properly evaluate them as there is no settled way to do so. Sentiment classes are ordered and unbalanced, and Twitter produces a stream of time-ordered data. The problem we address concerns the procedures used to obtain reliable estimates of performance measures, and whether the temporal ordering of the training and test data matters. We collected a large set of 1.5 million tweets in 13 European languages. We created 138 sentiment models and out-of-sample datasets, which are used as a gold standard for evaluations. The corresponding 138 in-sample datasets are used to empirically compare six different estimation procedures: three variants of cross-validation, and three variants of sequential validation (where test set always follows the training set). We find no significant difference between the best cross-validation and sequential validation. However, we observe that all cross-validation variants tend to overestimate the performance, while the sequential methods tend to underestimate it. Standard cross-validation with random selection of examples is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.
社交媒体正成为公众对选举、英国脱欧、股票市场等问题情绪的重要信息来源。本文主要关注的是推特数据的情感分类。情感分类器的构建是标准的文本挖掘任务,但在这里,我们要解决的问题是如何正确地对其进行评估,因为目前还没有确定的方法。情感类别是有序且不平衡的,而推特会产生一连串按时间排序的数据。我们要解决的问题涉及到获得性能指标可靠估计的程序,以及训练数据和测试数据的时间顺序是否重要。我们收集了 13 种语言的 150 万条推文。创建了 138 个情感模型和样本外数据集,这些数据集被用作评估的黄金标准。相应的 138 个样本内数据集用于经验比较六种不同的估计程序:交叉验证的三种变体和顺序验证的三种变体(其中测试集始终紧跟训练集)。我们发现最佳交叉验证和顺序验证之间没有显著差异。但是,我们观察到所有的交叉验证变体都倾向于高估性能,而顺序方法则倾向于低估性能。随机选择示例的标准交叉验证明显不如分块交叉验证好,因此不应该用于评估有序时间数据场景中的分类器。