Zhou Lexin, Schellaert Wout, Martínez-Plumed Fernando, Moros-Daval Yael, Ferri Cèsar, Hernández-Orallo José
Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Valencia, Spain.
University of Cambridge, Cambridge, UK.
Nature. 2024 Oct;634(8032):61-68. doi: 10.1038/s41586-024-07930-y. Epub 2024 Sep 25.
The prevailing methods to make large language models more powerful and amenable have been based on continuous scaling up (that is, increasing their size, data volume and computational resources) and bespoke shaping up (including post-filtering, fine tuning or use of human feedback). However, larger and more instructable large language models may have become less reliable. By studying the relationship between difficulty concordance, task avoidance and prompting stability of several language model families, here we show that easy instances for human participants are also easy for the models, but scaled-up, shaped-up models do not secure areas of low difficulty in which either the model does not err or human supervision can spot the errors. We also find that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook. Moreover, we observe that stability to different natural phrasings of the same question is improved by scaling-up and shaping-up interventions, but pockets of variability persist across difficulty levels. These findings highlight the need for a fundamental shift in the design and development of general-purpose artificial intelligence, particularly in high-stakes areas for which a predictable distribution of errors is paramount.
使大语言模型更强大且更易于使用的主流方法一直基于持续扩大规模(即增加其大小、数据量和计算资源)以及定制塑造(包括后过滤、微调或使用人工反馈)。然而,更大且更具指令性的大语言模型可能变得不那么可靠了。通过研究几个语言模型家族的难度一致性、任务回避和提示稳定性之间的关系,我们在此表明,对人类参与者来说简单的实例对模型来说也简单,但扩大规模、塑造后的模型并不能确保存在低难度区域,在这些区域中模型不会出错或人工监督能够发现错误。我们还发现,早期模型常常回避用户问题,但扩大规模、塑造后的模型更倾向于给出看似合理却错误的答案,包括在人类监督者经常忽略的难题上出现的错误。此外,我们观察到,通过扩大规模和塑造干预,对同一问题不同自然表述的稳定性得到了提高,但在不同难度级别上仍存在一些变异性。这些发现凸显了在通用人工智能的设计和开发中进行根本性转变的必要性,尤其是在对错误的可预测分布至关重要的高风险领域。