• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

自回归的余烬表明,大型语言模型是如何被它们被训练来解决的问题所塑造的。

Embers of autoregression show how large language models are shaped by the problem they are trained to solve.

机构信息

Department of Computer Science, Princeton University, Princeton, NJ 08542.

Department of Psychology, Princeton University, Princeton, NJ 08542.

出版信息

Proc Natl Acad Sci U S A. 2024 Oct 8;121(41):e2322420121. doi: 10.1073/pnas.2322420121. Epub 2024 Oct 4.

DOI:10.1073/pnas.2322420121
PMID:39365822
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11474099/
Abstract

The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach-which we call the teleological approach-we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system-one that has been shaped by its own particular set of pressures.

摘要

大型语言模型(LLM)的广泛应用使得认识到它们的优势和局限性变得非常重要。我们认为,要全面了解这些系统,就必须考虑到它们被训练来解决的问题:互联网文本的下一个单词预测。通过认识到这项任务所施加的压力,我们可以对这些系统将采用的策略做出预测,从而可以推断它们何时会成功或失败。我们将这种方法称为目的论方法,并用它来识别我们假设会影响 LLM 准确性的三个因素:要执行的任务的概率、目标输出的概率以及提供的输入的概率。为了验证我们的预测,我们在 11 项任务上评估了五个 LLM(GPT-3.5、GPT-4、Claude 3、Llama 3 和 Gemini 1.0),结果发现有强有力的证据表明 LLM 确实会按照假设的方式受到概率的影响。许多实验揭示了令人惊讶的失败模式。例如,当输出是高概率句子时,GPT-4 解码简单密码的准确率为 51%,但当输出是低概率句子时,准确率仅为 13%,尽管对于这个任务,GPT-4 应该不应该受到概率的影响。这些结果表明,人工智能从业者在低概率情况下应该谨慎使用 LLM。更广泛地说,我们的结论是,我们不应该将 LLM 视为人类,而应该将其视为一种独特的系统,这种系统是由其自身特定的压力所塑造的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/07fa368b249c/pnas.2322420121fig09.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/621192a3515a/pnas.2322420121fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/dd557147de9d/pnas.2322420121fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/c261c6cb3319/pnas.2322420121fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/8c833dadebb8/pnas.2322420121fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/c71465ff4db3/pnas.2322420121fig05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/420dbfd3616f/pnas.2322420121fig06.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/90dfe7e94858/pnas.2322420121fig07.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/282d424d6da7/pnas.2322420121fig08.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/07fa368b249c/pnas.2322420121fig09.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/621192a3515a/pnas.2322420121fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/dd557147de9d/pnas.2322420121fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/c261c6cb3319/pnas.2322420121fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/8c833dadebb8/pnas.2322420121fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/c71465ff4db3/pnas.2322420121fig05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/420dbfd3616f/pnas.2322420121fig06.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/90dfe7e94858/pnas.2322420121fig07.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/282d424d6da7/pnas.2322420121fig08.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f74b/11474099/07fa368b249c/pnas.2322420121fig09.jpg

相似文献

1
Embers of autoregression show how large language models are shaped by the problem they are trained to solve.自回归的余烬表明,大型语言模型是如何被它们被训练来解决的问题所塑造的。
Proc Natl Acad Sci U S A. 2024 Oct 8;121(41):e2322420121. doi: 10.1073/pnas.2322420121. Epub 2024 Oct 4.
2
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
3
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
4
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
5
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.大型语言模型能否在系统评价中取代人类?评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。
Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.
6
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
7
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
8
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
9
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
10
A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks.对基准生物医学文本处理任务中大型语言模型的全面评估。
Comput Biol Med. 2024 Mar;171:108189. doi: 10.1016/j.compbiomed.2024.108189. Epub 2024 Feb 20.

引用本文的文献

1
A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists.一个根据化学家的专业知识来评估大语言模型化学知识和推理能力的框架。
Nat Chem. 2025 May 20. doi: 10.1038/s41557-025-01815-x.
2
Specialized Large Language Model Outperforms Neurologists at Complex Diagnosis in Blinded Case-Based Evaluation.在基于病例的盲法评估中,专业大语言模型在复杂诊断方面的表现优于神经科医生。
Brain Sci. 2025 Mar 27;15(4):347. doi: 10.3390/brainsci15040347.
3
How to evaluate the cognitive abilities of LLMs.

本文引用的文献

1
Dissociating language and thought in large language models.大语言模型中的语言与思维分离。
Trends Cogn Sci. 2024 Jun;28(6):517-540. doi: 10.1016/j.tics.2024.01.011. Epub 2024 Mar 19.
2
Rational Simplification and Rigidity in Human Planning.人类规划中的合理简化与僵化
Psychol Sci. 2023 Nov;34(11):1281-1292. doi: 10.1177/09567976231200547. Epub 2023 Oct 25.
3
Emergent analogical reasoning in large language models.大语言模型中的紧急类比推理。
如何评估大语言模型的认知能力。
Nat Hum Behav. 2025 Feb;9(2):230-233. doi: 10.1038/s41562-024-02096-z.
4
Testing AI on language comprehension tasks reveals insensitivity to underlying meaning.在语言理解任务上测试 AI 会暴露出其对潜在含义的不敏感。
Sci Rep. 2024 Nov 14;14(1):28083. doi: 10.1038/s41598-024-79531-8.
5
Untrained neural networks can demonstrate memorization-independent abstract reasoning.未经训练的神经网络可以展现出不依赖记忆的抽象推理能力。
Sci Rep. 2024 Nov 8;14(1):27249. doi: 10.1038/s41598-024-78530-z.
Nat Hum Behav. 2023 Sep;7(9):1526-1541. doi: 10.1038/s41562-023-01659-w. Epub 2023 Jul 31.
4
How do we know how smart AI systems are?我们如何知道人工智能系统有多聪明?
Science. 2023 Jul 14;381(6654):adj5957. doi: 10.1126/science.adj5957. Epub 2023 Jul 13.
5
Holistic Evaluation of Language Models.语言模型的整体评估。
Ann N Y Acad Sci. 2023 Jul;1525(1):140-146. doi: 10.1111/nyas.15007. Epub 2023 May 25.
6
The debate over understanding in AI's large language models.人工智能大型语言模型中的理解之争。
Proc Natl Acad Sci U S A. 2023 Mar 28;120(13):e2215907120. doi: 10.1073/pnas.2215907120. Epub 2023 Mar 21.
7
Probing the psychology of AI models.探究人工智能模型的心理学
Proc Natl Acad Sci U S A. 2023 Mar 7;120(10):e2300963120. doi: 10.1073/pnas.2300963120. Epub 2023 Mar 1.
8
Using cognitive psychology to understand GPT-3.利用认知心理学理解 GPT-3。
Proc Natl Acad Sci U S A. 2023 Feb 7;120(6):e2218523120. doi: 10.1073/pnas.2218523120. Epub 2023 Feb 2.
9
Can You Hear Me Now? Sensitive Comparisons of Human and Machine Perception.你现在能听到我吗?人与机器感知的敏感性比较。
Cogn Sci. 2022 Oct;46(10):e13191. doi: 10.1111/cogs.13191.
10
The neural architecture of language: Integrative modeling converges on predictive processing.语言的神经结构:综合建模趋向于预测处理。
Proc Natl Acad Sci U S A. 2021 Nov 9;118(45). doi: 10.1073/pnas.2105646118.