Department of Computer Science, Princeton University, Princeton, NJ 08542.
Department of Psychology, Princeton University, Princeton, NJ 08542.
Proc Natl Acad Sci U S A. 2024 Oct 8;121(41):e2322420121. doi: 10.1073/pnas.2322420121. Epub 2024 Oct 4.
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach-which we call the teleological approach-we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system-one that has been shaped by its own particular set of pressures.
大型语言模型(LLM)的广泛应用使得认识到它们的优势和局限性变得非常重要。我们认为,要全面了解这些系统,就必须考虑到它们被训练来解决的问题:互联网文本的下一个单词预测。通过认识到这项任务所施加的压力,我们可以对这些系统将采用的策略做出预测,从而可以推断它们何时会成功或失败。我们将这种方法称为目的论方法,并用它来识别我们假设会影响 LLM 准确性的三个因素:要执行的任务的概率、目标输出的概率以及提供的输入的概率。为了验证我们的预测,我们在 11 项任务上评估了五个 LLM(GPT-3.5、GPT-4、Claude 3、Llama 3 和 Gemini 1.0),结果发现有强有力的证据表明 LLM 确实会按照假设的方式受到概率的影响。许多实验揭示了令人惊讶的失败模式。例如,当输出是高概率句子时,GPT-4 解码简单密码的准确率为 51%,但当输出是低概率句子时,准确率仅为 13%,尽管对于这个任务,GPT-4 应该不应该受到概率的影响。这些结果表明,人工智能从业者在低概率情况下应该谨慎使用 LLM。更广泛地说,我们的结论是,我们不应该将 LLM 视为人类,而应该将其视为一种独特的系统,这种系统是由其自身特定的压力所塑造的。