Martínez Gonzalo, Molero Juan Diego, González Sandra, Conde Javier, Brysbaert Marc, Reviriego Pedro
Universidad Carlos III de Madrid, Madrid, Spain.
ETSI de Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain.
Behav Res Methods. 2024 Dec 4;57(1):5. doi: 10.3758/s13428-024-02515-z.
This study investigates the potential of large language models (LLMs) to provide accurate estimates of concreteness, valence, and arousal for multi-word expressions. Unlike previous artificial intelligence (AI) methods, LLMs can capture the nuanced meanings of multi-word expressions. We systematically evaluated GPT-4o's ability to predict concreteness, valence, and arousal. In Study 1, GPT-4o showed strong correlations with human concreteness ratings (r = .8) for multi-word expressions. In Study 2, these findings were repeated for valence and arousal ratings of individual words, matching or outperforming previous AI models. Studies 3-5 extended the valence and arousal analysis to multi-word expressions and showed good validity of the LLM-generated estimates for these stimuli as well. To help researchers with stimulus selection, we provide datasets with LLM-generated norms of concreteness, valence, and arousal for 126,397 English single words and 63,680 multi-word expressions.
本研究探讨了大语言模型(LLMs)为多词表达提供具体性、效价和唤醒度准确估计值的潜力。与以往的人工智能(AI)方法不同,大语言模型能够捕捉多词表达的细微含义。我们系统地评估了GPT-4o预测具体性、效价和唤醒度的能力。在研究1中,GPT-4o在多词表达的具体性评级方面与人类评级表现出很强的相关性(r = 0.8)。在研究2中,针对单个单词的效价和唤醒度评级重复了这些发现,与之前的人工智能模型相当或更优。研究3至5将效价和唤醒度分析扩展到多词表达,并表明大语言模型生成的这些刺激估计值也具有良好的效度。为帮助研究人员进行刺激选择,我们提供了包含大语言模型生成的126397个英语单字和63680个多词表达的具体性、效价和唤醒度规范的数据集。