Department of Sociology/ICS, Utrecht University, Utrecht, The Netherlands.
Behav Res Methods. 2024 Apr;56(4):2782-2803. doi: 10.3758/s13428-024-02381-9. Epub 2024 Apr 4.
Short texts generated by individuals in online environments can provide social and behavioral scientists with rich insights into these individuals' internal states. Trained manual coders can reliably interpret expressions of such internal states in text. However, manual coding imposes restrictions on the number of texts that can be analyzed, limiting our ability to extract insights from large-scale textual data. We evaluate the performance of several automatic text analysis methods in approximating trained human coders' evaluations across four coding tasks encompassing expressions of motives, norms, emotions, and stances. Our findings suggest that commonly used dictionaries, although performing well in identifying infrequent categories, generate false positives too frequently compared to other methods. We show that large language models trained on manually coded data yield the highest performance across all case studies. However, there are also instances where simpler methods show almost equal performance. Additionally, we evaluate the effectiveness of cutting-edge generative language models like GPT-4 in coding texts for internal states with the help of short instructions (so-called zero-shot classification). While promising, these models fall short of the performance of models trained on manually analyzed data. We discuss the strengths and weaknesses of various models and explore the trade-offs between model complexity and performance in different applications. Our work informs social and behavioral scientists of the challenges associated with text mining of large textual datasets, while providing best-practice recommendations.
个体在在线环境中生成的短文可以为社会和行为科学家提供丰富的内部状态信息。经过训练的人工编码员可以可靠地解释文本中此类内部状态的表达。然而,人工编码对可以分析的文本数量施加了限制,限制了我们从大规模文本数据中提取见解的能力。我们评估了几种自动文本分析方法在四个编码任务中接近经过训练的人类编码员评估的性能,这些任务涵盖动机、规范、情感和立场的表达。我们的研究结果表明,常用词典虽然在识别不常见的类别方面表现良好,但与其他方法相比,产生假阳性的频率过高。我们表明,在所有案例研究中,基于手动编码数据训练的大型语言模型表现最佳。然而,也有一些简单方法表现出几乎相同的性能的情况。此外,我们评估了 GPT-4 等前沿生成语言模型在接受简短指令(所谓的零样本分类)的情况下对内部状态编码文本的有效性。虽然很有希望,但这些模型的性能不及基于手动分析数据训练的模型。我们讨论了各种模型的优缺点,并探讨了不同应用中模型复杂性和性能之间的权衡。我们的工作使社会和行为科学家了解与大型文本数据集的文本挖掘相关的挑战,同时提供最佳实践建议。