文献检索，用中文搜 PubMed

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.

人们对于利用大型语言模型（LLM）构建问题解决助手的机会感到非常兴奋。然而，评估 LLM 的标准方法依赖于静态的输入和输出对；对于在交互环境中选择最适合使用的 LLM 以及如何根据环境进行变化，这是不够的。因此，静态评估限制了我们对语言模型能力的理解。我们引入了 CheckMate，这是一个可适应的原型平台，用于人类与 LLM 进行交互和评估。我们使用 CheckMate 进行了一项研究，评估了三种语言模型（InstructGPT、ChatGPT 和 GPT-4）作为本科生数学证明的助手，参与者包括本科生到数学教授在内的混合人群。我们发布了由此产生的交互和评分数据集 MathConverse。通过分析 MathConverse，我们得出了一个人类查询行为的分类法，并发现尽管存在普遍的正相关关系，但在 LLM 生成中，正确性和感知有用性之间存在明显的分歧，还有其他发现。此外，我们通过一系列由经验丰富的数学家贡献的案例研究，对 GPT-4 的数学问题解决有了更细致的理解。最后，我们为机器学习从业者和数学家提供了一些可行的建议：那些能够传达不确定性、对用户纠正反应良好、并能为其建议提供简明理由的模型，可能构成更好的助手。鉴于当前的局限性和潜在的意外错误，人类应该仔细检查 LLM 的输出。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过交互评估数学用语言模型。

Evaluating language models for mathematics through interactions.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献