文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

Can large language models reason about medical questions?

作者信息

Liévin Valentin, Hother Christoffer Egeberg, Motzfeldt Andreas Geert, Winther Ole

机构信息

Section for Cognitive Systems, Technical University of Denmark, Anker Engelunds Vej 101, 2800 Kongens Lyngby, Denmark.

FindZebra, Rådvadsvej 36, 2400 Copenhagen, Denmark.

出版信息

Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.


DOI:10.1016/j.patter.2024.100943
PMID:38487804
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10935498/
Abstract

Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.

摘要
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/0af99015f905/gr9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/7cac6340ebf3/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/da86eee69ad2/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/53e72559fd86/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/3f7979a91f77/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/1f8cae2283ec/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/04efbbcedb0a/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/7193053c557e/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/dedc9d6ee5d2/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/b60e3f453a21/gr8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/0af99015f905/gr9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/7cac6340ebf3/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/da86eee69ad2/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/53e72559fd86/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/3f7979a91f77/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/1f8cae2283ec/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/04efbbcedb0a/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/7193053c557e/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/dedc9d6ee5d2/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/b60e3f453a21/gr8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/0af99015f905/gr9.jpg

相似文献

[1]
Can large language models reason about medical questions?

Patterns (N Y). 2024-3-1

[2]
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023-2-8

[3]
OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models.

Sci Rep. 2024-6-19

[4]
Evaluating prompt engineering on GPT-3.5's performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4.

Sci Rep. 2024-7-28

[5]
One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering.

medRxiv. 2023-12-24

[6]
Reasoning with large language models for medical question answering.

J Am Med Inform Assoc. 2024-9-1

[7]
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

JMIR Med Inform. 2024-4-8

[8]
Large language models encode clinical knowledge.

Nature. 2023-8

[9]
Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study.

JMIR Med Educ. 2024-1-18

[10]
A comparison of chain-of-thought reasoning strategies across datasets and models.

PeerJ Comput Sci. 2024-4-30

引用本文的文献

[1]
Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study.

J Med Internet Res. 2025-8-29

[2]
Application prospect of large language model represented by ChatGPT in ophthalmology.

Int J Ophthalmol. 2025-9-18

[3]
Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study.

Korean J Radiol. 2025-9

[4]
Exploring the use of large language models for classification, clinical interpretation, and treatment recommendation in breast tumor patient records.

Sci Rep. 2025-8-26

[5]
A scoping review of natural language processing in addressing medically inaccurate information: Errors, misinformation, and hallucination.

J Biomed Inform. 2025-7-22

[6]
EYE-Llama, an in-domain large language model for ophthalmology.

iScience. 2025-6-23

[7]
Evaluating multiple large language models on orbital diseases.

Front Cell Dev Biol. 2025-7-7

[8]
Comparing AI and human decision-making mechanisms in daily collaborative experiments.

iScience. 2025-5-21

[9]
Deep Learning in Digital Breast Tomosynthesis: Current Status, Challenges, and Future Trends.

MedComm (2020). 2025-6-9

[10]
Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis.

J Med Internet Res. 2025-6-9

本文引用的文献

[1]
ThoughtSource: A central hub for large language model reasoning data.

Sci Data. 2023-8-8

[2]
Large language models encode clinical knowledge.

Nature. 2023-8

[3]
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023-2-9

[4]
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023-2-8

[5]
Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3 (GPT-3) as a Service Model.

JMIR Med Inform. 2022-2-10

[6]
Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery.

NPJ Digit Med. 2021-6-3

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索