Department of Psychology, University of Waterloo, Waterloo, N2L 3G1, Canada.
Northeastern University, Boston, MA, USA.
Behav Res Methods. 2024 Oct;56(7):7632-7646. doi: 10.3758/s13428-024-02441-0. Epub 2024 May 29.
We investigated large language models' (LLMs) efficacy in classifying complex psychological constructs like intellectual humility, perspective-taking, open-mindedness, and search for a compromise in narratives of 347 Canadian and American adults reflecting on a workplace conflict. Using state-of-the-art models like GPT-4 across few-shot and zero-shot paradigms and RoB-ELoC (RoBERTa -fine-tuned-on-Emotion-with-Logistic-Regression-Classifier), we compared their performance with expert human coders. Results showed robust classification by LLMs, with over 80% agreement and F1 scores above 0.85, and high human-model reliability (Cohen's κ Md across top models = .80). RoB-ELoC and few-shot GPT-4 were standout classifiers, although somewhat less effective in categorizing intellectual humility. We offer example workflows for easy integration into research. Our proof-of-concept findings indicate the viability of both open-source and commercial LLMs in automating the coding of complex constructs, potentially transforming social science research.
我们研究了大型语言模型(LLMs)在分类复杂心理结构方面的效果,这些心理结构包括智力谦逊、换位思考、开放思维以及在 347 名加拿大和美国成年人对工作场所冲突的叙述中寻找妥协。我们使用 GPT-4 等最先进的模型,在少量和零样本范式以及 RoB-ELoC(在带有逻辑回归分类器的情感上微调的 RoBERTa)中进行测试,将它们的性能与专家人类编码员进行了比较。结果表明,LLMs 能够进行稳健的分类,其准确率超过 80%,F1 得分高于 0.85,并且人类-模型可靠性较高(顶级模型的 Cohen's κ Md =.80)。RoB-ELoC 和少量 GPT-4 是出色的分类器,尽管在分类智力谦逊方面效果稍差。我们提供了易于集成到研究中的示例工作流程。我们的概念验证结果表明,开源和商业 LLM 都有可能自动化复杂结构的编码,这可能会改变社会科学研究。