Schlegel Katja, Sommer Nils R, Mortillaro Marcello
Institute of Psychology, University of Bern, Bern, Switzerland.
Institute of Psychology, Czech Academy of Sciences, Brno, Czech Republic.
Commun Psychol. 2025 May 21;3(1):80. doi: 10.1038/s44271-025-00258-x.
Large Language Models (LLMs) demonstrate expertise across diverse domains, yet their capacity for emotional intelligence remains uncertain. This research examined whether LLMs can solve and generate performance-based emotional intelligence tests. Results showed that ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3 outperformed humans on five standard emotional intelligence tests, achieving an average accuracy of 81%, compared to the 56% human average reported in the original validation studies. In a second step, ChatGPT-4 generated new test items for each emotional intelligence test. These new versions and the original tests were administered to human participants across five studies (total N = 467). Overall, original and ChatGPT-generated tests demonstrated statistically equivalent test difficulty. Perceived item clarity and realism, item content diversity, internal consistency, correlations with a vocabulary test, and correlations with an external ability emotional intelligence test were not statistically equivalent between original and ChatGPT-generated tests. However, all differences were smaller than Cohen's d ± 0.25, and none of the 95% confidence interval boundaries exceeded a medium effect size (d ± 0.50). Additionally, original and ChatGPT-generated tests were strongly correlated (r = 0.46). These findings suggest that LLMs can generate responses that are consistent with accurate knowledge about human emotions and their regulation.
大语言模型(LLMs)在多个领域展现出专业知识,但其情商能力仍不确定。本研究考察了大语言模型是否能够解决并生成基于表现的情商测试。结果显示,ChatGPT-4、ChatGPT-o1、Gemini 1.5 flash、Copilot 365、Claude 3.5 Haiku和DeepSeek V3在五项标准情商测试中表现优于人类,平均准确率达到81%,而原始验证研究报告的人类平均准确率为56%。在第二步中,ChatGPT-4为每项情商测试生成了新的测试项目。在五项研究中,将这些新版本和原始测试施测于人类参与者(总计N = 467)。总体而言,原始测试和ChatGPT生成的测试在统计上显示出等效的测试难度。原始测试和ChatGPT生成的测试在感知到的项目清晰度和现实性、项目内容多样性、内部一致性、与词汇测试的相关性以及与外部能力情商测试的相关性方面在统计上并不等效。然而,所有差异均小于科恩d值±0.25,且95%置信区间边界均未超过中等效应量(d值±0.50)。此外,原始测试和ChatGPT生成的测试具有很强的相关性(r = 0.46)。这些发现表明,大语言模型能够生成与关于人类情绪及其调节的准确知识相一致的回答。