Xu Weijia, Jojic Nebojsa, Rao Sudha, Brockett Chris, Dolan Bill
Microsoft Research, Redmond, WA 98052.
Proc Natl Acad Sci U S A. 2025 Sep 2;122(35):e2504966122. doi: 10.1073/pnas.2504966122. Epub 2025 Aug 28.
With rapid advances in large language models (LLMs), there has been an increasing application of LLMs in creative content ideation and generation. A critical question emerges: can current LLMs provide ideas that are diverse enough to truly bolster collective creativity? We examine two state-of-the-art LLMs, GPT-4 and LLaMA-3, on story generation and discover that LLM-generated stories often consist of plot elements that are echoed across a number of generations. To quantify this phenomenon, we introduce the score, an automatic metric that measures the uniqueness of a plot element among alternative storylines generated using the same prompt under an LLM. Evaluating on 100 short stories, we find that LLM-generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations and across different LLMs, while plots from the original human-written stories are rarely recreated or even echoed in pieces. Moreover, our human evaluation shows that the ranking of scores among story segments correlates moderately with human judgment of surprise level, even though score computation is completely automatic without relying on human judgment.
随着大语言模型(LLMs)的迅速发展,LLMs在创意内容构思和生成中的应用越来越多。一个关键问题出现了:当前的LLMs能否提供足够多样化的想法来真正促进集体创造力?我们在故事生成方面对两个最先进的LLMs,即GPT-4和LLaMA-3进行了研究,发现由LLMs生成的故事通常包含在几代人中反复出现的情节元素。为了量化这一现象,我们引入了 分数,这是一种自动指标,用于衡量在LLM下使用相同提示生成的替代故事情节中情节元素的独特性。通过对100个短篇小说进行评估,我们发现由LLMs生成的故事通常包含在几代人和不同的LLMs中频繁出现的特殊情节元素组合,而原始人工编写的故事中的情节很少被重现,甚至很少有片段被重复。此外,我们的人工评估表明,故事片段之间的 分数排名与人类对惊喜程度的判断有适度的相关性,尽管分数计算是完全自动的,不依赖于人类判断。