Department of Economic History and International Relations, Stockholm University, Stockholm, Sweden.
PLoS One. 2023 Sep 29;18(9):e0290762. doi: 10.1371/journal.pone.0290762. eCollection 2023.
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.
为了分析大量文本,社会科学研究人员越来越多地面临文本分类的挑战。当无法进行手动标记并且研究人员必须找到自动分类文本的方法时,计算机科学提供了一个有用的机器学习方法工具箱,但其在社会科学中的性能仍未得到充分研究。在本文中,我们将应用最广泛使用的文本分类器,并将其应用于社会科学研究中的一个典型研究场景,以比较它们的性能:一个相对较小的标记数据集,其中包含感兴趣类别的罕见情况,这是一个大型未标记数据集的一部分。作为一个示例案例,我们研究了关于气候变化的 Twitter 交流,这是跨学科社会科学研究中日益受到关注的一个主题。我们使用一个新的包含 5750 条来自不同国际组织关于气候变化这一高度模糊概念的推文的数据集,评估了根据推文是否与气候变化有关来自动分类推文的方法的性能。在这种情况下,我们强调了两个主要发现。首先,监督机器学习方法比最先进的词汇库表现更好,尤其是在类平衡增加的情况下。其次,传统的机器学习方法(如逻辑回归和随机森林)与复杂的深度学习方法表现相似,而所需的训练时间和计算资源要少得多。这些结果对社会科学研究中短文本的分析具有重要意义。