Kern Felix B, Wu Chien-Te, Chao Zenas C
International Research Center for Neurointelligence (WPI-IRCN), UTIAS, The University of Tokyo, Tokyo, Japan.
Br J Psychol. 2024 Jul 22. doi: 10.1111/bjop.12720.
Creativity is defined by three key factors: novelty, feasibility and value. While many creativity tests focus primarily on novelty, they often neglect feasibility and value, thereby limiting their reflection of real-world creativity. In this study, we employ GPT-4, a large language model, to assess these three dimensions in a Japanese-language Alternative Uses Test (AUT). Using a crowdsourced evaluation method, we acquire ground truth data for 30 question items and test various GPT prompt designs. Our findings show that asking for multiple responses in a single prompt, using an 'explain first, rate later' design, is both cost-effective and accurate (r = .62, .59 and .33 for novelty, feasibility and value, respectively). Moreover, our method offers comparable accuracy to existing methods in assessing novelty, without the need for training data. We also evaluate additional models such as GPT-4 Turbo, GPT-4 Omni and Claude 3.5 Sonnet. Comparable performance across these models demonstrates the universal applicability of our prompt design. Our results contribute a straightforward platform for instant AUT evaluation and provide valuable ground truth data for future methodological research.
新颖性、可行性和价值。虽然许多创造力测试主要关注新颖性,但它们往往忽视可行性和价值,从而限制了它们对现实世界创造力的反映。在本研究中,我们使用大型语言模型GPT-4来评估日语替代用途测试(AUT)中的这三个维度。我们采用众包评估方法,获取了30个问题项目的真实数据,并测试了各种GPT提示设计。我们的研究结果表明,在单个提示中要求提供多个回答,采用“先解释,后评分”的设计,既具有成本效益又准确(新颖性、可行性和价值的相关系数分别为0.62、0.59和0.33)。此外,我们的方法在评估新颖性时提供了与现有方法相当的准确性,而无需训练数据。我们还评估了其他模型,如GPT-4 Turbo、GPT-4 Omni和Claude 3.5 Sonnet。这些模型的可比性能证明了我们提示设计的普遍适用性。我们的结果为即时AUT评估提供了一个简单的平台,并为未来的方法学研究提供了有价值的真实数据。