文献检索，用中文搜 PubMed

Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study.

作者信息

Dennstädt Fabio, Schmerder Max, Riggenbach Elena, Mose Lucas, Bryjova Katarina, Bachmann Nicolas, Mackeprang Paul-Henry, Ahmadsei Maiwand, Sinovcic Dubravko, Windisch Paul, Zwahlen Daniel, Rogers Susanne, Riesterer Oliver, Maffei Martin, Gkika Eleni, Haddad Hathal, Peeken Jan, Putora Paul Martin, Glatzer Markus, Putz Florian, Hoefler Daniel, Christ Sebastian M, Filchenko Irina, Hastings Janna, Gaio Roberto, Chiang Lawrence, Aebersold Daniel M, Cihoric Nikola

机构信息

Inselspital, Department of Radiation Oncology, Bern University Hospital, University of Bern, Bern, Switzerland.

Department of Radiooncology and Radiotherapy, University Hospital Heidelberg, Heidelberg, Germany.

出版信息

J Med Internet Res. 2025 Sep 23;27:e69752. doi: 10.2196/69752.

DOI:10.2196/69752

PMID:40986858

Abstract

BACKGROUND

Large language models (LLMs) hold promise for supporting clinical tasks, particularly in data-driven and technical disciplines such as radiation oncology. While prior evaluation studies have focused on examination-style settings for evaluating LLMs, their performance in real-life clinical scenarios remains unclear. In the future, LLMs might be used as general AI assistants to answer questions arising in clinical practice. It is unclear how well a modern LLM, locally executed within the infrastructure of a hospital, would answer such questions compared with clinical experts.

OBJECTIVE

This study aimed to assess the performance of a locally deployed, state-of-the-art medical LLM in answering real-world clinical questions in radiation oncology compared with clinical experts. The aim was to evaluate the overall quality of answers, as well as the potential harmfulness of the answers if used for clinical decision-making.

METHODS

Physicians from 10 departments of European hospitals collected questions arising in the clinical practice of radiation oncology. Fifty of these questions were answered by 3 senior radiation oncology experts with at least 10 years of work experience, as well as the LLM Llama3-OpenBioLLM-70B (Ankit Pal and Malaikannan Sankarasubbu). In a blinded review, physicians rated the overall answer quality on a 5-point Likert scale (quality), assessed whether an answer might be potentially harmful if used for clinical decision-making (harmfulness), and determined if responses were from an expert or the LLM (recognizability). Comparisons between clinical experts and LLMs were then made for quality, harmfulness, and recognizability.

RESULTS

There were no significant differences between the quality of the answers between LLM and clinical experts (mean scores of 3.38 vs 3.63; median 4.00, IQR 3.00-4.00 vs median 3.67, IQR 3.33-4.00; P=.26; Wilcoxon signed rank test). The answers were deemed potentially harmful in 13% of cases for the clinical experts compared with 16% of cases for the LLM (P=.63; Fisher exact test). Physicians correctly identified whether an answer was given by a clinical expert or an LLM in 78% and 72% of cases, respectively.

CONCLUSIONS

A state-of-the-art medical LLM can answer real-life questions from the clinical practice of radiation oncology similarly well as clinical experts regarding overall quality and potential harmfulness. Such LLMs can already be deployed within the local hospital environment at an affordable cost. While LLMs may not yet be ready for clinical implementation as general AI assistants, the technology continues to improve at a rapid pace. Evaluation studies based on real-life situations are important to better understand the weaknesses and limitations of LLMs in clinical practice. Such studies are also crucial to define when the technology is ready for clinical implementation. Furthermore, education for health care professionals on generative AI is needed to ensure responsible clinical implementation of this transforming technology.

摘要