Suppr超能文献

大语言模型对科学研究进行总结时的泛化偏差。

Generalization bias in large language model summarization of scientific research.

作者信息

Peters Uwe, Chin-Yee Benjamin

机构信息

Utrecht University, Utrecht, The Netherlands.

Western University, London, Canada.

出版信息

R Soc Open Sci. 2025 Apr 30;12(4):241776. doi: 10.1098/rsos.241776. eCollection 2025 Apr.

Abstract

Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26-73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (odds ratio = 4.85, 95% CI [3.06, 7.70], < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.

摘要

由大语言模型(LLMs)驱动的人工智能聊天机器人有潜力提高公众科学素养并支持科学研究,因为它们能够用通俗易懂的语言快速总结复杂的科学信息。然而,在总结科学文本时,大语言模型可能会省略限制研究结论范围的细节,导致结果的概括超出原始研究应有的范围。我们测试了10个著名的大语言模型,包括ChatGPT-4o、ChatGPT-4.5、DeepSeek、LLaMA 3.3 70B和Claude 3.7 Sonnet,将4900个由大语言模型生成的摘要与其原始科学文本进行了比较。即使明确要求准确,大多数大语言模型对科学结果的概括也比原始文本中的更宽泛,DeepSeek、ChatGPT-4o和LLaMA 3.3 70B在26%-73%的情况下过度概括。在对大语言模型生成的科学摘要和人类撰写的科学摘要的直接比较中,大语言模型生成的摘要包含宽泛概括的可能性几乎是人类撰写摘要的五倍(优势比=4.85,95%置信区间[3.06, 7.70],<0.001)。值得注意的是,较新的模型在概括准确性方面往往比早期模型表现更差。我们的结果表明,许多广泛使用的大语言模型在过度概括科学结论方面存在强烈偏差,这对研究结果的大规模错误解读构成了重大风险。我们强调了潜在的缓解策略,包括降低大语言模型的温度设置以及对大语言模型的概括准确性进行基准测试。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00d2/12042776/8dc469524798/rsos.241776.f001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验