大语言模型在医学肿瘤学考试问题上的表现。

Performance of Large Language Models on Medical Oncology Examination Questions.

机构信息

Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.

Department of Medicine, University of Toronto, Toronto, Ontario, Canada.

出版信息

JAMA Netw Open. 2024 Jun 3;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641.

DOI:10.1001/jamanetworkopen.2024.17641

PMID:38888919

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11185976/

Abstract

IMPORTANCE

Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information.

OBJECTIVE

To evaluate the accuracy and safety of LLM answers on medical oncology examination questions.

DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs.

MAIN OUTCOMES AND MEASURES

The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm.

RESULTS

Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm.

CONCLUSIONS AND RELEVANCE

In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.

摘要

重要性

大型语言模型（LLMs）最近展现出了前所未有的回答问题的能力。来自其他领域的 LLM 研究可能无法推广到医学肿瘤学，因为后者是一个高风险的临床环境，需要快速整合新信息。

目的

评估 LLM 在医学肿瘤学考试问题上的准确性和安全性。

设计、设置和参与者：这是一项横断面研究，于 2023 年 5 月 28 日至 10 月 11 日进行。美国临床肿瘤学会（ASCO）肿瘤自我评估系列在 ASCO 连接上，欧洲肿瘤内科学会（ESMO）考试试验问题，以及一套原始的医学肿瘤学选择题，都被呈现给了 8 个 LLM。

主要结果和措施

主要结果是正确答案的百分比。医学肿瘤学家评估最佳 LLM 提供的解释的准确性，对错误类型进行分类，并估计潜在临床危害的可能性和程度。

结果

专有的 LLM 2 正确回答了 147 个问题中的 125 个（85.0%；95%置信区间，78.2%-90.4%；P<0.001 与随机回答相比）。专有的 LLM 2 优于早期版本的专有 LLM 1，后者正确回答了 147 个问题中的 89 个（60.5%；95%置信区间，52.2%-68.5%；P<0.001），以及最佳的开源 LLM，Mixtral-8x7B-v0.1，正确回答了 147 个问题中的 87 个（59.2%；95%置信区间，50.0%-66.4%；P<0.001）。专有 LLM 2 提供的解释对于 147 个问题中的 138 个（93.9%；95%置信区间，88.7%-97.2%）没有或只有小的错误。错误的回答最常见的是信息检索错误，特别是与最近的出版物有关，其次是错误的推理和阅读理解。如果在临床实践中实施，22 个错误答案中的 18 个（81.8%；95%置信区间，59.7%-94.8%）将具有中到高的中度到严重伤害的可能性。

结论和相关性

在这项关于 LLM 在医学肿瘤学考试问题上表现的横断面研究中，表现最好的 LLM 回答问题的表现非常出色，尽管错误引起了安全问题。这些结果表明，有机会开发和评估 LLM 以改善医疗保健临床医生的体验和患者护理，同时考虑到对能力和安全的潜在影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9da4/11185976/5168d99a79e4/jamanetwopen-e2417641-g001.jpg

相似文献

Performance of Large Language Models on Medical Oncology Examination Questions.

JAMA Netw Open. 2024 Jun 3;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641.

Performance of Large Language Models on a Neurology Board-Style Examination.

JAMA Netw Open. 2023 Dec 1;6(12):e2346721. doi: 10.1001/jamanetworkopen.2023.46721.

Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions.

JAMA Netw Open. 2024 Apr 1;7(4):e244630. doi: 10.1001/jamanetworkopen.2024.4630.

Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE.

JAMA Netw Open. 2025 Apr 1;8(4):e256359. doi: 10.1001/jamanetworkopen.2025.6359.

Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.

BMC Cancer. 2025 Feb 4;25(1):197. doi: 10.1186/s12885-025-13596-0.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.

J Oral Maxillofac Surg. 2025 Mar;83(3):382-389. doi: 10.1016/j.joms.2024.11.007. Epub 2024 Nov 19.

Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.

Neurosurg Rev. 2025 Mar 25;48(1):320. doi: 10.1007/s10143-025-03472-7.

Using Large Language Models to Automate Data Extraction From Surgical Pathology Reports: Retrospective Cohort Study.

JMIR Form Res. 2025 Apr 7;9:e64544. doi: 10.2196/64544.

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.

J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.

引用本文的文献

Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis.

Front Pharmacol. 2025 Jul 25;16:1649041. doi: 10.3389/fphar.2025.1649041. eCollection 2025.

The assessment of ChatGPT-4's performance compared to expert's consensus on chronic lateral ankle instability.

J Exp Orthop. 2025 Aug 5;12(3):e70393. doi: 10.1002/jeo2.70393. eCollection 2025 Jul.

Evaluating Large Language Models in Ptosis-Related inquiries: A Cross-Lingual Study.

Transl Vis Sci Technol. 2025 Jul 1;14(7):9. doi: 10.1167/tvst.14.7.9.

Large-scale deep learning for metastasis detection in pathology reports.

JAMIA Open. 2025 Jul 11;8(4):ooaf070. doi: 10.1093/jamiaopen/ooaf070. eCollection 2025 Aug.

Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.

Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.

Deep Learning Model for Natural Language to Assess Effectiveness of Patients With Non-Muscle Invasive Bladder Cancer Receiving Intravesical Bacillus Calmette-Guérin Therapy.

JCO Clin Cancer Inform. 2025 Jun;9:e2400249. doi: 10.1200/CCI-24-00249. Epub 2025 Jun 27.

Large language models in oncology: a review.

BMJ Oncol. 2025 May 15;4(1):e000759. doi: 10.1136/bmjonc-2025-000759. eCollection 2025.

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.

J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.

Large Language Models and Text Embeddings for Detecting Depression and Suicide in Patient Narratives.

JAMA Netw Open. 2025 May 1;8(5):e2511922. doi: 10.1001/jamanetworkopen.2025.11922.

Evaluation of Six Large Language Models for Clinical Decision Support: Application in Transfusion Decision-making for RhD Blood-type Patients.

Ann Lab Med. 2025 Sep 1;45(5):520-529. doi: 10.3343/alm.2024.0588. Epub 2025 Apr 28.

本文引用的文献

Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks.

JMIR AI. 2024 Jan 12;3:e50442. doi: 10.2196/50442.

Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions.

JAMA Netw Open. 2024 Apr 1;7(4):e244630. doi: 10.1001/jamanetworkopen.2024.4630.

Artificial Intelligence-Generated Draft Replies to Patient Inbox Messages.

JAMA Netw Open. 2024 Mar 4;7(3):e243201. doi: 10.1001/jamanetworkopen.2024.3201.

To do no harm - and the most good - with AI in health care.

Nat Med. 2024 Mar;30(3):623-627. doi: 10.1038/s41591-024-02853-7.

Almanac - Retrieval-Augmented Language Models for Clinical Medicine.

NEJM AI. 2024 Feb;1(2). doi: 10.1056/aioa2300068. Epub 2024 Jan 25.

Applications of large language models in cancer care: current evidence and future perspectives.

Front Oncol. 2023 Sep 4;13:1268915. doi: 10.3389/fonc.2023.1268915. eCollection 2023.

Use of Artificial Intelligence Chatbots for Cancer Treatment Information.

JAMA Oncol. 2023 Oct 1;9(10):1459-1462. doi: 10.1001/jamaoncol.2023.2954.

Creation and Adoption of Large Language Models in Medicine.

JAMA. 2023 Sep 5;330(9):866-869. doi: 10.1001/jama.2023.14217.

Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination.

JAMA Pediatr. 2023 Sep 1;177(9):977-979. doi: 10.1001/jamapediatrics.2023.2373.

Performance of an Upgraded Artificial Intelligence Chatbot for Ophthalmic Knowledge Assessment.

JAMA Ophthalmol. 2023 Aug 1;141(8):798-800. doi: 10.1001/jamaophthalmol.2023.2754.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型在医学肿瘤学考试问题上的表现。

Performance of Large Language Models on Medical Oncology Examination Questions.

机构信息

Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.

Department of Medicine, University of Toronto, Toronto, Ontario, Canada.