文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

大语言模型在医学肿瘤学考试问题上的表现。

Performance of Large Language Models on Medical Oncology Examination Questions.

机构信息

Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.

Department of Medicine, University of Toronto, Toronto, Ontario, Canada.

出版信息

JAMA Netw Open. 2024 Jun 3;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641.


DOI:10.1001/jamanetworkopen.2024.17641
PMID:38888919
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11185976/
Abstract

IMPORTANCE: Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information. OBJECTIVE: To evaluate the accuracy and safety of LLM answers on medical oncology examination questions. DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs. MAIN OUTCOMES AND MEASURES: The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm. RESULTS: Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm. CONCLUSIONS AND RELEVANCE: In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.

摘要

重要性:大型语言模型(LLMs)最近展现出了前所未有的回答问题的能力。来自其他领域的 LLM 研究可能无法推广到医学肿瘤学,因为后者是一个高风险的临床环境,需要快速整合新信息。

目的:评估 LLM 在医学肿瘤学考试问题上的准确性和安全性。

设计、设置和参与者:这是一项横断面研究,于 2023 年 5 月 28 日至 10 月 11 日进行。美国临床肿瘤学会(ASCO)肿瘤自我评估系列在 ASCO 连接上,欧洲肿瘤内科学会(ESMO)考试试验问题,以及一套原始的医学肿瘤学选择题,都被呈现给了 8 个 LLM。

主要结果和措施:主要结果是正确答案的百分比。医学肿瘤学家评估最佳 LLM 提供的解释的准确性,对错误类型进行分类,并估计潜在临床危害的可能性和程度。

结果:专有的 LLM 2 正确回答了 147 个问题中的 125 个(85.0%;95%置信区间,78.2%-90.4%;P<0.001 与随机回答相比)。专有的 LLM 2 优于早期版本的专有 LLM 1,后者正确回答了 147 个问题中的 89 个(60.5%;95%置信区间,52.2%-68.5%;P<0.001),以及最佳的开源 LLM,Mixtral-8x7B-v0.1,正确回答了 147 个问题中的 87 个(59.2%;95%置信区间,50.0%-66.4%;P<0.001)。专有 LLM 2 提供的解释对于 147 个问题中的 138 个(93.9%;95%置信区间,88.7%-97.2%)没有或只有小的错误。错误的回答最常见的是信息检索错误,特别是与最近的出版物有关,其次是错误的推理和阅读理解。如果在临床实践中实施,22 个错误答案中的 18 个(81.8%;95%置信区间,59.7%-94.8%)将具有中到高的中度到严重伤害的可能性。

结论和相关性:在这项关于 LLM 在医学肿瘤学考试问题上表现的横断面研究中,表现最好的 LLM 回答问题的表现非常出色,尽管错误引起了安全问题。这些结果表明,有机会开发和评估 LLM 以改善医疗保健临床医生的体验和患者护理,同时考虑到对能力和安全的潜在影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9da4/11185976/cd1f11e17d25/jamanetwopen-e2417641-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9da4/11185976/5168d99a79e4/jamanetwopen-e2417641-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9da4/11185976/070435ff8ffd/jamanetwopen-e2417641-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9da4/11185976/cd1f11e17d25/jamanetwopen-e2417641-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9da4/11185976/5168d99a79e4/jamanetwopen-e2417641-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9da4/11185976/070435ff8ffd/jamanetwopen-e2417641-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9da4/11185976/cd1f11e17d25/jamanetwopen-e2417641-g003.jpg

相似文献

[1]
Performance of Large Language Models on Medical Oncology Examination Questions.

JAMA Netw Open. 2024-6-3

[2]
Performance of Large Language Models on a Neurology Board-Style Examination.

JAMA Netw Open. 2023-12-1

[3]
Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions.

JAMA Netw Open. 2024-4-1

[4]
Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE.

JAMA Netw Open. 2025-4-1

[5]
Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.

BMC Cancer. 2025-2-4

[6]
Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.

JAMA Netw Open. 2023-8-1

[7]
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.

J Oral Maxillofac Surg. 2025-3

[8]
Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.

Neurosurg Rev. 2025-3-25

[9]
Using Large Language Models to Automate Data Extraction From Surgical Pathology Reports: Retrospective Cohort Study.

JMIR Form Res. 2025-4-7

[10]
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.

J Med Internet Res. 2025-2-7

引用本文的文献

[1]
Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis.

Front Pharmacol. 2025-7-25

[2]
The assessment of ChatGPT-4's performance compared to expert's consensus on chronic lateral ankle instability.

J Exp Orthop. 2025-8-5

[3]
Evaluating Large Language Models in Ptosis-Related inquiries: A Cross-Lingual Study.

Transl Vis Sci Technol. 2025-7-1

[4]
Large-scale deep learning for metastasis detection in pathology reports.

JAMIA Open. 2025-7-11

[5]
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.

Med Educ Online. 2025-12

[6]
Deep Learning Model for Natural Language to Assess Effectiveness of Patients With Non-Muscle Invasive Bladder Cancer Receiving Intravesical Bacillus Calmette-Guérin Therapy.

JCO Clin Cancer Inform. 2025-6

[7]
Large language models in oncology: a review.

BMJ Oncol. 2025-5-15

[8]
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.

J Med Internet Res. 2025-6-11

[9]
Large Language Models and Text Embeddings for Detecting Depression and Suicide in Patient Narratives.

JAMA Netw Open. 2025-5-1

[10]
Evaluation of Six Large Language Models for Clinical Decision Support: Application in Transfusion Decision-making for RhD Blood-type Patients.

Ann Lab Med. 2025-9-1

本文引用的文献

[1]
Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks.

JMIR AI. 2024-1-12

[2]
Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions.

JAMA Netw Open. 2024-4-1

[3]
Artificial Intelligence-Generated Draft Replies to Patient Inbox Messages.

JAMA Netw Open. 2024-3-4

[4]
To do no harm - and the most good - with AI in health care.

Nat Med. 2024-3

[5]
Almanac - Retrieval-Augmented Language Models for Clinical Medicine.

NEJM AI. 2024-2

[6]
Applications of large language models in cancer care: current evidence and future perspectives.

Front Oncol. 2023-9-4

[7]
Use of Artificial Intelligence Chatbots for Cancer Treatment Information.

JAMA Oncol. 2023-10-1

[8]
Creation and Adoption of Large Language Models in Medicine.

JAMA. 2023-9-5

[9]
Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination.

JAMA Pediatr. 2023-9-1

[10]
Performance of an Upgraded Artificial Intelligence Chatbot for Ophthalmic Knowledge Assessment.

JAMA Ophthalmol. 2023-8-1

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索