• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过交互评估数学用语言模型。

Evaluating language models for mathematics through interactions.

机构信息

University of Cambridge, Cambridge CB2 1TN, United Kingdom.

University of Oxford, Oxford OX1 4BH, United Kingdom.

出版信息

Proc Natl Acad Sci U S A. 2024 Jun 11;121(24):e2318124121. doi: 10.1073/pnas.2318124121. Epub 2024 Jun 3.

DOI:10.1073/pnas.2318124121
PMID:38830100
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11181017/
Abstract

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.

摘要

人们对于利用大型语言模型(LLM)构建问题解决助手的机会感到非常兴奋。然而,评估 LLM 的标准方法依赖于静态的输入和输出对;对于在交互环境中选择最适合使用的 LLM 以及如何根据环境进行变化,这是不够的。因此,静态评估限制了我们对语言模型能力的理解。我们引入了 CheckMate,这是一个可适应的原型平台,用于人类与 LLM 进行交互和评估。我们使用 CheckMate 进行了一项研究,评估了三种语言模型(InstructGPT、ChatGPT 和 GPT-4)作为本科生数学证明的助手,参与者包括本科生到数学教授在内的混合人群。我们发布了由此产生的交互和评分数据集 MathConverse。通过分析 MathConverse,我们得出了一个人类查询行为的分类法,并发现尽管存在普遍的正相关关系,但在 LLM 生成中,正确性和感知有用性之间存在明显的分歧,还有其他发现。此外,我们通过一系列由经验丰富的数学家贡献的案例研究,对 GPT-4 的数学问题解决有了更细致的理解。最后,我们为机器学习从业者和数学家提供了一些可行的建议:那些能够传达不确定性、对用户纠正反应良好、并能为其建议提供简明理由的模型,可能构成更好的助手。鉴于当前的局限性和潜在的意外错误,人类应该仔细检查 LLM 的输出。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6721/11181017/3663921b1265/pnas.2318124121fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6721/11181017/0883ea5ef413/pnas.2318124121fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6721/11181017/281589f5e5e1/pnas.2318124121fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6721/11181017/3663921b1265/pnas.2318124121fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6721/11181017/0883ea5ef413/pnas.2318124121fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6721/11181017/281589f5e5e1/pnas.2318124121fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6721/11181017/3663921b1265/pnas.2318124121fig03.jpg

相似文献

1
Evaluating language models for mathematics through interactions.通过交互评估数学用语言模型。
Proc Natl Acad Sci U S A. 2024 Jun 11;121(24):e2318124121. doi: 10.1073/pnas.2318124121. Epub 2024 Jun 3.
2
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
3
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
4
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
5
Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.大型语言模型在 3 个临床专业领域的治疗推荐中的应用:比较研究。
J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.
6
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。
J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.
7
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.
8
Use of Large Language Models to Predict Neuroimaging.大语言模型在神经影像学预测中的应用。
J Am Coll Radiol. 2023 Oct;20(10):1004-1009. doi: 10.1016/j.jacr.2023.06.008. Epub 2023 Jul 8.
9
The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.大型语言模型在变革急诊医学中的作用:范围综述
JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.
10
Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis.大型语言模型防范生成健康类虚假信息的现行保障措施、风险缓解措施和透明度措施:重复横断面分析。
BMJ. 2024 Mar 20;384:e078538. doi: 10.1136/bmj-2023-078538.

引用本文的文献

1
The dawn of a new era: can machine learning and large language models reshape QSP modeling?新时代的曙光:机器学习和大语言模型能否重塑定量系统药理学建模?
J Pharmacokinet Pharmacodyn. 2025 Jun 16;52(4):36. doi: 10.1007/s10928-025-09984-5.
2
Comparing AI and human decision-making mechanisms in daily collaborative experiments.在日常协作实验中比较人工智能与人类的决策机制。
iScience. 2025 May 21;28(6):112711. doi: 10.1016/j.isci.2025.112711. eCollection 2025 Jun 20.
3
Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology.

本文引用的文献

1
Peano: learning formal mathematical reasoning.皮阿诺:学习形式化的数学推理。
Philos Trans A Math Phys Eng Sci. 2023 Jul 24;381(2251):20220044. doi: 10.1098/rsta.2022.0044. Epub 2023 Jun 5.
2
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
3
Rethink reporting of evaluation results in AI.重新思考人工智能评估结果的报告方式。
医疗保健教育中的人工智能:评估ChatGPT、Copilot和谷歌Gemini在心血管药理学方面的准确性。
Front Med (Lausanne). 2025 Feb 19;12:1495378. doi: 10.3389/fmed.2025.1495378. eCollection 2025.
4
Accessible interactive learning of mathematical expressions for school students with visual disabilities.为视障学生提供便于参与的数学表达式互动学习。
PeerJ Comput Sci. 2024 Dec 23;10:e2599. doi: 10.7717/peerj-cs.2599. eCollection 2024.
5
Should Artificial Intelligence Play a Durable Role in Biomedical Research and Practice?人工智能在生物医学研究与实践中应扮演持久的角色吗?
Int J Mol Sci. 2024 Dec 13;25(24):13371. doi: 10.3390/ijms252413371.
6
Building machines that learn and think with people.与人类一起学习和思考的机器。
Nat Hum Behav. 2024 Oct;8(10):1851-1863. doi: 10.1038/s41562-024-01991-9. Epub 2024 Oct 22.
Science. 2023 Apr 14;380(6641):136-138. doi: 10.1126/science.adf6369. Epub 2023 Apr 13.
4
Large language models and the perils of their hallucinations.大语言模型及其幻觉的风险。
Crit Care. 2023 Mar 21;27(1):120. doi: 10.1186/s13054-023-04393-x.
5
Acquisition of chess knowledge in AlphaZero.阿尔法零(AlphaZero)获取国际象棋知识。
Proc Natl Acad Sci U S A. 2022 Nov 22;119(47):e2206625119. doi: 10.1073/pnas.2206625119. Epub 2022 Nov 14.
6
How transparency modulates trust in artificial intelligence.透明度如何调节对人工智能的信任。
Patterns (N Y). 2022 Feb 24;3(4):100455. doi: 10.1016/j.patter.2022.100455. eCollection 2022 Apr 8.
7
Advancing mathematics by guiding human intuition with AI.用人工智能引导人类直觉推动数学发展。
Nature. 2021 Dec;600(7887):70-74. doi: 10.1038/s41586-021-04086-x. Epub 2021 Dec 1.
8
In Pursuit of Error: A Survey of Uncertainty Visualization Evaluation.追踪错误:不确定性可视化评估调查
IEEE Trans Vis Comput Graph. 2018 Sep 10. doi: 10.1109/TVCG.2018.2864889.
9
Use and Misuse of the Likert Item Responses and Other Ordinal Measures.李克特量表项目回答及其他有序测量的使用与误用
Int J Exerc Sci. 2015 Jul 1;8(3):297-302. doi: 10.70252/LANZ1453. eCollection 2015.