Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.
Department of Medicine, University of Toronto, Toronto, Ontario, Canada.
JAMA Netw Open. 2024 Jun 3;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641.
IMPORTANCE: Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information. OBJECTIVE: To evaluate the accuracy and safety of LLM answers on medical oncology examination questions. DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs. MAIN OUTCOMES AND MEASURES: The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm. RESULTS: Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm. CONCLUSIONS AND RELEVANCE: In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.
重要性:大型语言模型(LLMs)最近展现出了前所未有的回答问题的能力。来自其他领域的 LLM 研究可能无法推广到医学肿瘤学,因为后者是一个高风险的临床环境,需要快速整合新信息。
目的:评估 LLM 在医学肿瘤学考试问题上的准确性和安全性。
设计、设置和参与者:这是一项横断面研究,于 2023 年 5 月 28 日至 10 月 11 日进行。美国临床肿瘤学会(ASCO)肿瘤自我评估系列在 ASCO 连接上,欧洲肿瘤内科学会(ESMO)考试试验问题,以及一套原始的医学肿瘤学选择题,都被呈现给了 8 个 LLM。
主要结果和措施:主要结果是正确答案的百分比。医学肿瘤学家评估最佳 LLM 提供的解释的准确性,对错误类型进行分类,并估计潜在临床危害的可能性和程度。
结果:专有的 LLM 2 正确回答了 147 个问题中的 125 个(85.0%;95%置信区间,78.2%-90.4%;P<0.001 与随机回答相比)。专有的 LLM 2 优于早期版本的专有 LLM 1,后者正确回答了 147 个问题中的 89 个(60.5%;95%置信区间,52.2%-68.5%;P<0.001),以及最佳的开源 LLM,Mixtral-8x7B-v0.1,正确回答了 147 个问题中的 87 个(59.2%;95%置信区间,50.0%-66.4%;P<0.001)。专有 LLM 2 提供的解释对于 147 个问题中的 138 个(93.9%;95%置信区间,88.7%-97.2%)没有或只有小的错误。错误的回答最常见的是信息检索错误,特别是与最近的出版物有关,其次是错误的推理和阅读理解。如果在临床实践中实施,22 个错误答案中的 18 个(81.8%;95%置信区间,59.7%-94.8%)将具有中到高的中度到严重伤害的可能性。
结论和相关性:在这项关于 LLM 在医学肿瘤学考试问题上表现的横断面研究中,表现最好的 LLM 回答问题的表现非常出色,尽管错误引起了安全问题。这些结果表明,有机会开发和评估 LLM 以改善医疗保健临床医生的体验和患者护理,同时考虑到对能力和安全的潜在影响。
JAMA Netw Open. 2024-6-3
JAMA Netw Open. 2023-12-1
JAMA Netw Open. 2024-4-1
J Oral Maxillofac Surg. 2025-3
J Med Internet Res. 2025-2-7
Transl Vis Sci Technol. 2025-7-1
JAMIA Open. 2025-7-11
BMJ Oncol. 2025-5-15
JAMA Netw Open. 2024-4-1
JAMA Netw Open. 2024-3-4
Nat Med. 2024-3
JAMA Oncol. 2023-10-1
JAMA Ophthalmol. 2023-8-1