Rydzewski Nicholas R, Dinakaran Deepak, Zhao Shuang G, Ruppin Eytan, Turkbey Baris, Citrin Deborah E, Patel Krishnan R
Radiation Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD.
Physical Sciences Platform, Sunnybrook Research Institute, Toronto, ON, Canada.
NEJM AI. 2024 May;1(5). doi: 10.1056/aioa2300151. Epub 2024 Apr 16.
As artificial intelligence (AI) tools become widely accessible, more patients and medical professionals will turn to them for medical information. Large language models (LLMs), a subset of AI, excel in natural language processing tasks and hold considerable promise for clinical use. Fields such as oncology, in which clinical decisions are highly dependent on a continuous influx of new clinical trial data and evolving guidelines, stand to gain immensely from such advancements. It is therefore of critical importance to benchmark these models and describe their performance characteristics to guide their safe application to clinical oncology. Accordingly, the primary objectives of this work were to conduct comprehensive evaluations of LLMs in the field of oncology and to identify and characterize strategies that medical professionals can use to bolster their confidence in a model's response.
This study tested five publicly available LLMs (LLaMA 1, PaLM 2, Claude-v1, generative pretrained transformer 3.5 [GPT-3.5], and GPT-4) on a comprehensive battery of 2044 oncology questions, including topics from medical oncology, surgical oncology, radiation oncology, medical statistics, medical physics, and cancer biology. Model prompts were presented independently of each other, and each prompt was repeated three times to assess output consistency. For each response, models were instructed to provide a self-appraised confidence score (from 1 to 4). Model performance was also evaluated against a novel validation set comprising 50 oncology questions curated to eliminate any risk of overlap with the data used to train the LLMs.
There was significant heterogeneity in performance between models (analysis of variance, P<0.001). Relative to a human benchmark (2013 and 2014 examination results), GPT-4 was the only model to perform above the 50th percentile. Overall, model performance varied as a function of subject area across all models, with worse performance observed in clinical oncology subcategories compared with foundational topics (medical statistics, medical physics, and cancer biology). Within the clinical oncology subdomain, worse performance was observed in female-predominant malignancies. A combination of model selection, prompt repetition, and confidence self-appraisal allowed for the identification of high-performing subgroups of questions with observed accuracies of 81.7 and 81.1% in the Claude-v1 and GPT-4 models, respectively. Evaluation of the novel validation question set produced similar trends in model performance while also highlighting improved performance in newer, centrally hosted models (GPT-4 Turbo and Gemini 1.0 Ultra) and local models (Mixtral 8×7B and LLaMA 2).
Of the models tested on a standardized set of oncology questions, GPT-4 was observed to have the highest performance. Although this performance is impressive, all LLMs continue to have clinically significant error rates, including examples of overconfidence and consistent inaccuracies. Given the enthusiasm to integrate these new implementations of AI into clinical practice, continued standardized evaluations of the strengths and limitations of these products will be critical to guide both patients and medical professionals. (Funded by the National Institutes of Health Clinical Center for Research and the Intramural Research Program of the National Institutes of Health; Z99 CA999999.).
随着人工智能(AI)工具的广泛普及,越来越多的患者和医学专业人员将向其寻求医学信息。大语言模型(LLMs)作为AI的一个子集,在自然语言处理任务方面表现出色,在临床应用中具有巨大潜力。肿瘤学等领域的临床决策高度依赖于不断涌入的新临床试验数据和不断演变的指南,有望从这些进展中大幅受益。因此,对这些模型进行基准测试并描述其性能特征以指导其在临床肿瘤学中的安全应用至关重要。相应地,本研究的主要目的是对肿瘤学领域的大语言模型进行全面评估,并确定和描述医学专业人员可用于增强对模型回答信心的策略。
本研究在一组涵盖2044个肿瘤学问题的综合测试中对五个公开可用的大语言模型(大语言模型1、PaLM 2、Claude-v1、生成式预训练变换器3.5 [GPT-3.5]和GPT-4)进行了测试,这些问题包括医学肿瘤学、外科肿瘤学、放射肿瘤学、医学统计学、医学物理学和癌症生物学等主题。模型提示相互独立呈现,每个提示重复三次以评估输出一致性。对于每个回答,要求模型提供一个自我评估的置信度分数(从1到4)。还针对一个新的验证集评估了模型性能,该验证集包含精心挑选的50个肿瘤学问题,以消除与用于训练大语言模型的数据存在重叠的任何风险。
模型之间的性能存在显著异质性(方差分析,P<0.001)。相对于人类基准(2013年和2014年考试结果),GPT-4是唯一表现高于第50百分位数的模型。总体而言,所有模型的性能因主题领域而异,与基础主题(医学统计学、医学物理学和癌症生物学)相比,临床肿瘤学子类别的性能较差。在临床肿瘤学子领域内,在女性为主的恶性肿瘤中观察到较差的性能。通过模型选择、提示重复和置信度自我评估的组合,可以识别出问题的高性能子组,Claude-v1和GPT-4模型中观察到的准确率分别为81.7%和81.1%。对新验证问题集的评估在模型性能方面产生了类似的趋势,同时也突出了较新的、集中托管的模型(GPT-4 Turbo和Gemini 1.0 Ultra)和本地模型(Mixtral 8×7B和大语言模型2)的性能提升。
在一组标准化的肿瘤学问题上进行测试的模型中,观察到GPT-4具有最高性能。尽管这一性能令人印象深刻,但所有大语言模型的临床错误率仍然很高,包括过度自信和持续不准确的例子。鉴于将这些新的人工智能应用集成到临床实践中的热情,持续对这些产品的优势和局限性进行标准化评估对于指导患者和医学专业人员都至关重要。(由美国国立卫生研究院临床研究中心和美国国立卫生研究院内部研究项目资助;Z99 CA999999。)