Op 't Hof Martin, Hu Ke, Tong Song, Bai Honghong
School of Articifical Intelligence, Radboud University, 6500 HE Nijmegen, The Netherlands.
Department of Psychological and Cognitive Sciences, Tsinghua University, Beijing 100084, China.
J Intell. 2025 Jul 3;13(7):80. doi: 10.3390/jintelligence13070080.
Current research predominantly involves human subjects to evaluate AI creativity. In this explorative study, we questioned the validity of this practice and examined how creator-assessor (dis)similarity-namely to what extent the creator and the assessor were alike-along two dimensions of culture (Western and English-speaking vs. Eastern and Chinese-speaking) and agency (human vs. AI) influences the assessment of creativity. We first asked four types of subjects to create stories, including Eastern participants (university students from China), Eastern AI (Kimi from China), Western participants (university students from The Netherlands), and Western AI (ChatGPT 3.5 from the US). Both Eastern participants and AI created stories in Chinese, which were then translated into English, while both Western participants and AI created stories in English, which were then translated into Chinese. A subset of these stories (2 creative and 2 uncreative per creator type, in total 16 stories) was then randomly selected as assessment materials. Adopting a within-subject design, we then asked new subjects from the same four types ( = 120, 30 per type) to assess these stories on creativity, originality, and appropriateness. The results confirmed that similarities in both dimensions of culture and agency influence the assessment of originality and appropriateness. As for the agency dimension, human assessors preferred human-created stories for originality, while AI assessors showed no preference. Conversely, AI assessors rated AI-generated stories higher in appropriateness, whereas human assessors showed no preference. Culturally, both Eastern and Western assessors favored Eastern-created stories in originality. And as for appropriateness, the assessors always preferred stories from the creators with the same cultural backgrounds. The present study is significant in attempting to ask an often-overlooked question and provides the first empirical evidence to underscore the need for more discussion on using humans to judge AI agents' creativity or the other way around.
当前的研究主要涉及人类受试者以评估人工智能的创造力。在这项探索性研究中,我们质疑了这种做法的有效性,并研究了创作者与评估者在文化(西方和英语国家与东方和华语国家)和主体(人类与人工智能)这两个维度上的(不)相似性,即创作者和评估者的相似程度,如何影响创造力评估。我们首先让四类受试者创作故事,包括东方参与者(来自中国的大学生)、东方人工智能(来自中国的豆包)、西方参与者(来自荷兰的大学生)和西方人工智能(来自美国的ChatGPT 3.5)。东方参与者和人工智能都用中文创作故事,然后翻译成英文,而西方参与者和人工智能都用英文创作故事,然后翻译成中文。然后从这些故事中随机抽取一部分(每种创作者类型各2篇有创造力的和2篇无创造力的,共16篇故事)作为评估材料。采用受试者内设计,我们然后让来自相同四类的新受试者( = 120,每种类型30人)对这些故事的创造力、原创性和恰当性进行评估。结果证实,文化和主体这两个维度上的相似性都会影响对原创性和恰当性的评估。在主体维度方面,人类评估者在原创性上更喜欢人类创作的故事,而人工智能评估者没有表现出偏好。相反,人工智能评估者认为人工智能生成的故事在恰当性上得分更高,而人类评估者没有表现出偏好。在文化方面,东方和西方的评估者在原创性上都更青睐东方创作的故事。至于恰当性,评估者总是更喜欢来自具有相同文化背景创作者的故事。本研究在试图提出一个经常被忽视的问题方面具有重要意义,并提供了首个实证证据,以强调需要更多地讨论使用人类来评判人工智能主体的创造力,或者反之亦然。