Suppr超能文献

大型语言模型擅长解决和创建情商测试。

Large language models are proficient in solving and creating emotional intelligence tests.

作者信息

Schlegel Katja, Sommer Nils R, Mortillaro Marcello

机构信息

Institute of Psychology, University of Bern, Bern, Switzerland.

Institute of Psychology, Czech Academy of Sciences, Brno, Czech Republic.

出版信息

Commun Psychol. 2025 May 21;3(1):80. doi: 10.1038/s44271-025-00258-x.

Abstract

Large Language Models (LLMs) demonstrate expertise across diverse domains, yet their capacity for emotional intelligence remains uncertain. This research examined whether LLMs can solve and generate performance-based emotional intelligence tests. Results showed that ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3 outperformed humans on five standard emotional intelligence tests, achieving an average accuracy of 81%, compared to the 56% human average reported in the original validation studies. In a second step, ChatGPT-4 generated new test items for each emotional intelligence test. These new versions and the original tests were administered to human participants across five studies (total N = 467). Overall, original and ChatGPT-generated tests demonstrated statistically equivalent test difficulty. Perceived item clarity and realism, item content diversity, internal consistency, correlations with a vocabulary test, and correlations with an external ability emotional intelligence test were not statistically equivalent between original and ChatGPT-generated tests. However, all differences were smaller than Cohen's d ± 0.25, and none of the 95% confidence interval boundaries exceeded a medium effect size (d ± 0.50). Additionally, original and ChatGPT-generated tests were strongly correlated (r = 0.46). These findings suggest that LLMs can generate responses that are consistent with accurate knowledge about human emotions and their regulation.

摘要

大语言模型(LLMs)在多个领域展现出专业知识,但其情商能力仍不确定。本研究考察了大语言模型是否能够解决并生成基于表现的情商测试。结果显示,ChatGPT-4、ChatGPT-o1、Gemini 1.5 flash、Copilot 365、Claude 3.5 Haiku和DeepSeek V3在五项标准情商测试中表现优于人类,平均准确率达到81%,而原始验证研究报告的人类平均准确率为56%。在第二步中,ChatGPT-4为每项情商测试生成了新的测试项目。在五项研究中,将这些新版本和原始测试施测于人类参与者(总计N = 467)。总体而言,原始测试和ChatGPT生成的测试在统计上显示出等效的测试难度。原始测试和ChatGPT生成的测试在感知到的项目清晰度和现实性、项目内容多样性、内部一致性、与词汇测试的相关性以及与外部能力情商测试的相关性方面在统计上并不等效。然而,所有差异均小于科恩d值±0.25,且95%置信区间边界均未超过中等效应量(d值±0.50)。此外,原始测试和ChatGPT生成的测试具有很强的相关性(r = 0.46)。这些发现表明,大语言模型能够生成与关于人类情绪及其调节的准确知识相一致的回答。

相似文献

1
Large language models are proficient in solving and creating emotional intelligence tests.
Commun Psychol. 2025 May 21;3(1):80. doi: 10.1038/s44271-025-00258-x.
5
Can large language models detect drug-drug interactions leading to adverse drug reactions?
Ther Adv Drug Saf. 2025 May 16;16:20420986251339358. doi: 10.1177/20420986251339358. eCollection 2025.

本文引用的文献

1
Evaluating large language models in theory of mind tasks.
Proc Natl Acad Sci U S A. 2024 Nov 5;121(45):e2405460121. doi: 10.1073/pnas.2405460121. Epub 2024 Oct 29.
2
Deception abilities emerged in large language models.
Proc Natl Acad Sci U S A. 2024 Jun 11;121(24):e2317967121. doi: 10.1073/pnas.2317967121. Epub 2024 Jun 4.
3
Testing theory of mind in large language models and humans.
Nat Hum Behav. 2024 Jul;8(7):1285-1295. doi: 10.1038/s41562-024-01882-z. Epub 2024 May 20.
5
Large language models are able to downplay their cognitive abilities to fit the persona they simulate.
PLoS One. 2024 Mar 13;19(3):e0298522. doi: 10.1371/journal.pone.0298522. eCollection 2024.
6
In praise of empathic AI.
Trends Cogn Sci. 2024 Feb;28(2):89-91. doi: 10.1016/j.tics.2023.12.003. Epub 2023 Dec 29.
7
8
Living in a brave new AI era.
Nat Hum Behav. 2023 Nov;7(11):1799. doi: 10.1038/s41562-023-01775-7.
9
We need a culturally aware approach to AI.
Nat Hum Behav. 2023 Nov;7(11):1816-1817. doi: 10.1038/s41562-023-01738-y.
10
Generative AI has a language problem.
Nat Hum Behav. 2023 Nov;7(11):1802-1803. doi: 10.1038/s41562-023-01716-4.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验