创作者与评估者之间的（不）相似性对创造力评估的影响：人类与大语言模型的比较

The Effects of (Dis)similarities Between the Creator and the Assessor on Assessing Creativity: A Comparison of Humans and LLMs.

作者信息

Op 't Hof Martin, Hu Ke, Tong Song, Bai Honghong

机构信息

School of Articifical Intelligence, Radboud University, 6500 HE Nijmegen, The Netherlands.

Department of Psychological and Cognitive Sciences, Tsinghua University, Beijing 100084, China.

出版信息

J Intell. 2025 Jul 3;13(7):80. doi: 10.3390/jintelligence13070080.

DOI:10.3390/jintelligence13070080

PMID:40710813

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12295035/

Abstract

Current research predominantly involves human subjects to evaluate AI creativity. In this explorative study, we questioned the validity of this practice and examined how creator-assessor (dis)similarity-namely to what extent the creator and the assessor were alike-along two dimensions of culture (Western and English-speaking vs. Eastern and Chinese-speaking) and agency (human vs. AI) influences the assessment of creativity. We first asked four types of subjects to create stories, including Eastern participants (university students from China), Eastern AI (Kimi from China), Western participants (university students from The Netherlands), and Western AI (ChatGPT 3.5 from the US). Both Eastern participants and AI created stories in Chinese, which were then translated into English, while both Western participants and AI created stories in English, which were then translated into Chinese. A subset of these stories (2 creative and 2 uncreative per creator type, in total 16 stories) was then randomly selected as assessment materials. Adopting a within-subject design, we then asked new subjects from the same four types ( = 120, 30 per type) to assess these stories on creativity, originality, and appropriateness. The results confirmed that similarities in both dimensions of culture and agency influence the assessment of originality and appropriateness. As for the agency dimension, human assessors preferred human-created stories for originality, while AI assessors showed no preference. Conversely, AI assessors rated AI-generated stories higher in appropriateness, whereas human assessors showed no preference. Culturally, both Eastern and Western assessors favored Eastern-created stories in originality. And as for appropriateness, the assessors always preferred stories from the creators with the same cultural backgrounds. The present study is significant in attempting to ask an often-overlooked question and provides the first empirical evidence to underscore the need for more discussion on using humans to judge AI agents' creativity or the other way around.

摘要

当前的研究主要涉及人类受试者以评估人工智能的创造力。在这项探索性研究中，我们质疑了这种做法的有效性，并研究了创作者与评估者在文化（西方和英语国家与东方和华语国家）和主体（人类与人工智能）这两个维度上的（不）相似性，即创作者和评估者的相似程度，如何影响创造力评估。我们首先让四类受试者创作故事，包括东方参与者（来自中国的大学生）、东方人工智能（来自中国的豆包）、西方参与者（来自荷兰的大学生）和西方人工智能（来自美国的ChatGPT 3.5）。东方参与者和人工智能都用中文创作故事，然后翻译成英文，而西方参与者和人工智能都用英文创作故事，然后翻译成中文。然后从这些故事中随机抽取一部分（每种创作者类型各2篇有创造力的和2篇无创造力的，共16篇故事）作为评估材料。采用受试者内设计，我们然后让来自相同四类的新受试者（ = 120，每种类型30人）对这些故事的创造力、原创性和恰当性进行评估。结果证实，文化和主体这两个维度上的相似性都会影响对原创性和恰当性的评估。在主体维度方面，人类评估者在原创性上更喜欢人类创作的故事，而人工智能评估者没有表现出偏好。相反，人工智能评估者认为人工智能生成的故事在恰当性上得分更高，而人类评估者没有表现出偏好。在文化方面，东方和西方的评估者在原创性上都更青睐东方创作的故事。至于恰当性，评估者总是更喜欢来自具有相同文化背景创作者的故事。本研究在试图提出一个经常被忽视的问题方面具有重要意义，并提供了首个实证证据，以强调需要更多地讨论使用人类来评判人工智能主体的创造力，或者反之亦然。

相似文献

The Effects of (Dis)similarities Between the Creator and the Assessor on Assessing Creativity: A Comparison of Humans and LLMs.

J Intell. 2025 Jul 3;13(7):80. doi: 10.3390/jintelligence13070080.

Eliciting adverse effects data from participants in clinical trials.

Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.

Self-Set Goals: Autistic Adults Facilitating Their Self-Determination Through Digitally Mediated Social Stories.

Autism Adulthood. 2025 Feb 5;7(1):25-38. doi: 10.1089/aut.2023.0063. eCollection 2025 Feb.

Cultural competence education for health professionals.

Cochrane Database Syst Rev. 2014 May 5;2014(5):CD009405. doi: 10.1002/14651858.CD009405.pub2.

Health professionals' experience of teamwork education in acute hospital settings: a systematic review of qualitative literature.

JBI Database System Rev Implement Rep. 2016 Apr;14(4):96-137. doi: 10.11124/JBISRIR-2016-1843.

Systemic therapies for preventing or treating aromatase inhibitor-induced musculoskeletal symptoms in early breast cancer.

Cochrane Database Syst Rev. 2022 Jan 10;1(1):CD013167. doi: 10.1002/14651858.CD013167.pub2.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.

Clin Orthop Relat Res. 2025 Apr 1;483(4):647-663. doi: 10.1097/CORR.0000000000003394. Epub 2025 Feb 5.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.

Cochrane Database Syst Rev. 2024 Dec 16;12(12):CD014780. doi: 10.1002/14651858.CD014780.

本文引用的文献

Cultural bias and cultural alignment of large language models.

PNAS Nexus. 2024 Sep 17;3(9):pgae346. doi: 10.1093/pnasnexus/pgae346. eCollection 2024 Sep.

GPT is an effective tool for multilingual psychological text analysis.

Proc Natl Acad Sci U S A. 2024 Aug 20;121(34):e2308950121. doi: 10.1073/pnas.2308950121. Epub 2024 Aug 12.

Understanding how personality traits, experiences, and attitudes shape negative bias toward AI-generated artworks.

Sci Rep. 2024 Feb 19;14(1):4113. doi: 10.1038/s41598-024-54294-4.

Eyes can tell: Assessment of implicit attitudes toward AI art.

Iperception. 2023 Oct 30;14(5):20416695231209846. doi: 10.1177/20416695231209846. eCollection 2023 Sep-Oct.

Bias against AI art can enhance perceptions of human creativity.

Sci Rep. 2023 Nov 3;13(1):19001. doi: 10.1038/s41598-023-45202-3.

Transmission Versus Truth, Imitation Versus Innovation: What Children Can Do That Large Language and Language-and-Vision Models Cannot (Yet).

Perspect Psychol Sci. 2024 Sep;19(5):874-883. doi: 10.1177/17456916231201401. Epub 2023 Oct 26.

Best humans still outperform artificial intelligence in a creative divergent thinking task.

Sci Rep. 2023 Sep 14;13(1):13601. doi: 10.1038/s41598-023-40858-3.

Emergent analogical reasoning in large language models.

Nat Hum Behav. 2023 Sep;7(9):1526-1541. doi: 10.1038/s41562-023-01659-w. Epub 2023 Jul 31.

Proc Natl Acad Sci U S A. 2021 Jun 22;118(25). doi: 10.1073/pnas.2022340118.

Automating creativity assessment with SemDis: An open platform for computing semantic distance.

Behav Res Methods. 2021 Apr;53(2):757-780. doi: 10.3758/s13428-020-01453-w.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

创作者与评估者之间的（不）相似性对创造力评估的影响：人类与大语言模型的比较

The Effects of (Dis)similarities Between the Creator and the Assessor on Assessing Creativity: A Comparison of Humans and LLMs.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献