智慧的阴影：通过大型语言模型对元认知和基于道德的叙事内容进行分类。

Shadows of wisdom: Classifying meta-cognitive and morally grounded narrative content via large language models.

机构信息

Department of Psychology, University of Waterloo, Waterloo, N2L 3G1, Canada.

Northeastern University, Boston, MA, USA.

出版信息

Behav Res Methods. 2024 Oct;56(7):7632-7646. doi: 10.3758/s13428-024-02441-0. Epub 2024 May 29.

DOI:10.3758/s13428-024-02441-0

PMID:38811519

Abstract

We investigated large language models' (LLMs) efficacy in classifying complex psychological constructs like intellectual humility, perspective-taking, open-mindedness, and search for a compromise in narratives of 347 Canadian and American adults reflecting on a workplace conflict. Using state-of-the-art models like GPT-4 across few-shot and zero-shot paradigms and RoB-ELoC (RoBERTa -fine-tuned-on-Emotion-with-Logistic-Regression-Classifier), we compared their performance with expert human coders. Results showed robust classification by LLMs, with over 80% agreement and F1 scores above 0.85, and high human-model reliability (Cohen's κ Md across top models = .80). RoB-ELoC and few-shot GPT-4 were standout classifiers, although somewhat less effective in categorizing intellectual humility. We offer example workflows for easy integration into research. Our proof-of-concept findings indicate the viability of both open-source and commercial LLMs in automating the coding of complex constructs, potentially transforming social science research.

摘要

我们研究了大型语言模型（LLMs）在分类复杂心理结构方面的效果，这些心理结构包括智力谦逊、换位思考、开放思维以及在 347 名加拿大和美国成年人对工作场所冲突的叙述中寻找妥协。我们使用 GPT-4 等最先进的模型，在少量和零样本范式以及 RoB-ELoC（在带有逻辑回归分类器的情感上微调的 RoBERTa）中进行测试，将它们的性能与专家人类编码员进行了比较。结果表明，LLMs 能够进行稳健的分类，其准确率超过 80%，F1 得分高于 0.85，并且人类-模型可靠性较高（顶级模型的 Cohen's κ Md =.80）。RoB-ELoC 和少量 GPT-4 是出色的分类器，尽管在分类智力谦逊方面效果稍差。我们提供了易于集成到研究中的示例工作流程。我们的概念验证结果表明，开源和商业 LLM 都有可能自动化复杂结构的编码，这可能会改变社会科学研究。