临床肿瘤学中大型语言模型的比较评估

Comparative Evaluation of LLMs in Clinical Oncology.

作者信息

Rydzewski Nicholas R, Dinakaran Deepak, Zhao Shuang G, Ruppin Eytan, Turkbey Baris, Citrin Deborah E, Patel Krishnan R

机构信息

Radiation Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD.

Physical Sciences Platform, Sunnybrook Research Institute, Toronto, ON, Canada.

出版信息

NEJM AI. 2024 May;1(5). doi: 10.1056/aioa2300151. Epub 2024 Apr 16.

DOI:10.1056/aioa2300151

PMID:39131700

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11315428/

Abstract

BACKGROUND

As artificial intelligence (AI) tools become widely accessible, more patients and medical professionals will turn to them for medical information. Large language models (LLMs), a subset of AI, excel in natural language processing tasks and hold considerable promise for clinical use. Fields such as oncology, in which clinical decisions are highly dependent on a continuous influx of new clinical trial data and evolving guidelines, stand to gain immensely from such advancements. It is therefore of critical importance to benchmark these models and describe their performance characteristics to guide their safe application to clinical oncology. Accordingly, the primary objectives of this work were to conduct comprehensive evaluations of LLMs in the field of oncology and to identify and characterize strategies that medical professionals can use to bolster their confidence in a model's response.

METHODS

This study tested five publicly available LLMs (LLaMA 1, PaLM 2, Claude-v1, generative pretrained transformer 3.5 [GPT-3.5], and GPT-4) on a comprehensive battery of 2044 oncology questions, including topics from medical oncology, surgical oncology, radiation oncology, medical statistics, medical physics, and cancer biology. Model prompts were presented independently of each other, and each prompt was repeated three times to assess output consistency. For each response, models were instructed to provide a self-appraised confidence score (from 1 to 4). Model performance was also evaluated against a novel validation set comprising 50 oncology questions curated to eliminate any risk of overlap with the data used to train the LLMs.

RESULTS

There was significant heterogeneity in performance between models (analysis of variance, P<0.001). Relative to a human benchmark (2013 and 2014 examination results), GPT-4 was the only model to perform above the 50th percentile. Overall, model performance varied as a function of subject area across all models, with worse performance observed in clinical oncology subcategories compared with foundational topics (medical statistics, medical physics, and cancer biology). Within the clinical oncology subdomain, worse performance was observed in female-predominant malignancies. A combination of model selection, prompt repetition, and confidence self-appraisal allowed for the identification of high-performing subgroups of questions with observed accuracies of 81.7 and 81.1% in the Claude-v1 and GPT-4 models, respectively. Evaluation of the novel validation question set produced similar trends in model performance while also highlighting improved performance in newer, centrally hosted models (GPT-4 Turbo and Gemini 1.0 Ultra) and local models (Mixtral 8×7B and LLaMA 2).

CONCLUSIONS

Of the models tested on a standardized set of oncology questions, GPT-4 was observed to have the highest performance. Although this performance is impressive, all LLMs continue to have clinically significant error rates, including examples of overconfidence and consistent inaccuracies. Given the enthusiasm to integrate these new implementations of AI into clinical practice, continued standardized evaluations of the strengths and limitations of these products will be critical to guide both patients and medical professionals. (Funded by the National Institutes of Health Clinical Center for Research and the Intramural Research Program of the National Institutes of Health; Z99 CA999999.).

摘要

背景

随着人工智能（AI）工具的广泛普及，越来越多的患者和医学专业人员将向其寻求医学信息。大语言模型（LLMs）作为AI的一个子集，在自然语言处理任务方面表现出色，在临床应用中具有巨大潜力。肿瘤学等领域的临床决策高度依赖于不断涌入的新临床试验数据和不断演变的指南，有望从这些进展中大幅受益。因此，对这些模型进行基准测试并描述其性能特征以指导其在临床肿瘤学中的安全应用至关重要。相应地，本研究的主要目的是对肿瘤学领域的大语言模型进行全面评估，并确定和描述医学专业人员可用于增强对模型回答信心的策略。

方法

本研究在一组涵盖2044个肿瘤学问题的综合测试中对五个公开可用的大语言模型（大语言模型1、PaLM 2、Claude-v1、生成式预训练变换器3.5 [GPT-3.5]和GPT-4）进行了测试，这些问题包括医学肿瘤学、外科肿瘤学、放射肿瘤学、医学统计学、医学物理学和癌症生物学等主题。模型提示相互独立呈现，每个提示重复三次以评估输出一致性。对于每个回答，要求模型提供一个自我评估的置信度分数（从1到4）。还针对一个新的验证集评估了模型性能，该验证集包含精心挑选的50个肿瘤学问题，以消除与用于训练大语言模型的数据存在重叠的任何风险。

结果

模型之间的性能存在显著异质性（方差分析，P<0.001）。相对于人类基准（2013年和2014年考试结果），GPT-4是唯一表现高于第50百分位数的模型。总体而言，所有模型的性能因主题领域而异，与基础主题（医学统计学、医学物理学和癌症生物学）相比，临床肿瘤学子类别的性能较差。在临床肿瘤学子领域内，在女性为主的恶性肿瘤中观察到较差的性能。通过模型选择、提示重复和置信度自我评估的组合，可以识别出问题的高性能子组，Claude-v1和GPT-4模型中观察到的准确率分别为81.7%和81.1%。对新验证问题集的评估在模型性能方面产生了类似的趋势，同时也突出了较新的、集中托管的模型（GPT-4 Turbo和Gemini 1.0 Ultra）和本地模型（Mixtral 8×7B和大语言模型2）的性能提升。

结论

在一组标准化的肿瘤学问题上进行测试的模型中，观察到GPT-4具有最高性能。尽管这一性能令人印象深刻，但所有大语言模型的临床错误率仍然很高，包括过度自信和持续不准确的例子。鉴于将这些新的人工智能应用集成到临床实践中的热情，持续对这些产品的优势和局限性进行标准化评估对于指导患者和医学专业人员都至关重要。（由美国国立卫生研究院临床研究中心和美国国立卫生研究院内部研究项目资助；Z99 CA999999。）

相似文献

Comparative Evaluation of LLMs in Clinical Oncology.

NEJM AI. 2024 May;1(5). doi: 10.1056/aioa2300151. Epub 2024 Apr 16.

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.

JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.

Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.

BMC Med Inform Decis Mak. 2025 May 23;25(1):196. doi: 10.1186/s12911-025-03024-5.

Evaluation of Large Language Models in Tailoring Educational Content for Cancer Survivors and Their Caregivers: Quality Analysis.

JMIR Cancer. 2025 Apr 7;11:e67914. doi: 10.2196/67914.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.

J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.

Evaluating large language models on a highly-specialized topic, radiation oncology physics.

Front Oncol. 2023 Jul 17;13:1219326. doi: 10.3389/fonc.2023.1219326. eCollection 2023.

引用本文的文献

Development and evaluation of a lightweight large language model chatbot for medication enquiry.

PLOS Digit Health. 2025 Sep 4;4(9):e0000961. doi: 10.1371/journal.pdig.0000961. eCollection 2025 Sep.

Transforming Population Health Screening for Atherosclerotic Cardiovascular Disease with AI-Enhanced ECG Analytics: Opportunities and Challenges.

Curr Atheroscler Rep. 2025 Sep 1;27(1):86. doi: 10.1007/s11883-025-01337-4.

Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study.

J Med Internet Res. 2025 Aug 29;27:e64348. doi: 10.2196/64348.

Enhancing Clinical Decision Support with Adaptive Iterative Self-Query Retrieval for Retrieval-Augmented Large Language Models.

Bioengineering (Basel). 2025 Aug 21;12(8):895. doi: 10.3390/bioengineering12080895.

A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study.

JMIR Form Res. 2025 Aug 19;9:e70863. doi: 10.2196/70863.

Multimodal Sensing-Enabled Large Language Models for Automated Emotional Regulation: A Review of Current Technologies, Opportunities, and Challenges.

Sensors (Basel). 2025 Aug 1;25(15):4763. doi: 10.3390/s25154763.

Assessing ChatGPT's Educational Potential in Lung Cancer Radiotherapy From Clinician and Patient Perspectives: Content Quality and Readability Analysis.

JMIR Cancer. 2025 Aug 13;11:e69783. doi: 10.2196/69783.

Development and evaluation of large-language models (LLMs) for oncology: A scoping review.

PLOS Digit Health. 2025 Aug 7;4(8):e0000980. doi: 10.1371/journal.pdig.0000980. eCollection 2025 Aug.

Improving large language models accuracy for aortic stenosis treatment via Heart Team simulation: a prompt design analysis.

Eur Heart J Digit Health. 2025 Jun 16;6(4):665-674. doi: 10.1093/ehjdh/ztaf068. eCollection 2025 Jul.

Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5.

Discov Oncol. 2025 Jul 1;16(1):1227. doi: 10.1007/s12672-025-02911-7.

本文引用的文献

GPT-4 passes the bar exam.

Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.

A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports.

Sci Rep. 2023 Nov 17;13(1):20159. doi: 10.1038/s41598-023-47500-2.

Leveraging Large Language Models for Decision Support in Personalized Oncology.

JAMA Netw Open. 2023 Nov 1;6(11):e2343689. doi: 10.1001/jamanetworkopen.2023.43689.

Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge.

JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.

The role of artificial intelligence in dermatology: the promising but limited accuracy of ChatGPT in diagnosing clinical scenarios.

Int J Dermatol. 2023 Oct;62(10):e547-e548. doi: 10.1111/ijd.16746. Epub 2023 Jun 12.

ChatGPT's quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions.

Eur Arch Otorhinolaryngol. 2023 Sep;280(9):4271-4278. doi: 10.1007/s00405-023-08051-4. Epub 2023 Jun 7.

Performance of ChatGPT on the Plastic Surgery Inservice Training Examination.

Aesthet Surg J. 2023 Nov 16;43(12):NP1078-NP1082. doi: 10.1093/asj/sjad128.

Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment.

JAMA Ophthalmol. 2023 Jun 1;141(6):589-597. doi: 10.1001/jamaophthalmol.2023.1144.

Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study.

Radiology. 2023 May;307(4):e230725. doi: 10.1148/radiol.230725. Epub 2023 Apr 4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

临床肿瘤学中大型语言模型的比较评估

Comparative Evaluation of LLMs in Clinical Oncology.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献