Park Jun Sung, Hwang Jisun, Kim Pyeong Hwa, Shim Woo Hyun, Seo Min Jeong, Kim Dahyun, Shin Jeong In, Kim In Hwa, Heo Hwon, Suh Chong Hyun
Department of Pediatrics, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea.
Department of Radiology, Seoul National University Bundang Hospital, Seoul National University College of Medicine, Seongnam, Republic of Korea.
Korean J Radiol. 2025 Sep;26(9):855-866. doi: 10.3348/kjr.2025.0240.
To evaluate the accuracy of multimodal large language models (LLMs) in detecting cases requiring immediate radiology reporting in pediatric radiology.
Seventy-one publicly available, paraphrased pediatric clinical vignettes with images-sourced from the , , , and -were assessed by seven vision-capable LLMs (temperature levels 0 and 1; t0 and t1) and four human readers (an expert pediatric radiologist, a trainee radiologist, an expert pediatrician, and a trainee pediatrician). Cases were classified as requiring immediate reporting (n = 33) if they corresponded to Korean Triage and Acuity Scale (KTAS) levels 1-2 (n = 24) or met the criteria for a critical value report (CVR) (n = 11). The most accurate LLM was compared with each human reader, with significance set at < 0.013.
LLMs demonstrated 60.6%-83.1% accuracy in detecting cases requiring immediate radiology reporting: 57.7%-81.7% and 53.5%-87.3% for KTAS levels 1-2 and CVR cases, respectively. Gemini-Flash with t1 showed the highest accuracy among the LLMs: 83.1% (95% confidence interval [CI]: 74.6%-91.5%), 81.7% (95% CI: 71.8%-90.1%), and 87.3% (95% CI: 78.9%-94.4%) for identifying cases requiring immediate reporting, KTAS level 1-2 cases, and CVR cases, respectively, despite its low sensitivity for CVR detection (3/11, 27.3%). Human readers demonstrated 62.0%-84.5% accuracy for immediate radiology reporting, 73.2%-84.5% for KTAS levels 1-2, and 39.4%-94.4% for CVR cases. The accuracy of Gemini-Flash t1 in identifying cases requiring immediate radiology reporting was comparable to that of the most accurate human reader (vs. expert pediatrician: 84.5% [95% CI: 76.1%-93.0%]; < 0.99).
Multimodal LLMs may achieve overall accuracy comparable to or exceeding that of human readers in identifying cases requiring immediate radiology reporting, supporting their potential use for pediatric radiology worklist prioritization. However, the models' sensitivity in detecting such cases was not reliable.
评估多模态大语言模型(LLMs)在检测儿科放射学中需要立即进行放射学报告的病例时的准确性。
从[具体来源1]、[具体来源2]、[具体来源3]和[具体来源4]获取了71个公开可用的带有图像的儿科临床病例摘要,并由7个具备视觉能力的大语言模型(温度水平0和1;t0和t1)以及4名人类读者(一名儿科放射学专家、一名放射学实习医生、一名儿科专家和一名儿科实习医生)进行评估。如果病例符合韩国分诊和 acuity 量表(KTAS)1 - 2级(n = 24)或满足危急值报告(CVR)标准(n = 11),则将其分类为需要立即报告(n = 33)。将最准确的大语言模型与每位人类读者进行比较,显著性设定为<0.013。
大语言模型在检测需要立即进行放射学报告的病例时准确率为60.6% - 83.1%:KTAS 1 - 2级病例的准确率为57.7% - 81.7%,CVR病例的准确率为53.5% - 87.3%。t1温度水平的Gemini - Flash在大语言模型中显示出最高的准确率:识别需要立即报告的病例、KTAS 1 - 2级病例和CVR病例的准确率分别为83.1%(95%置信区间[CI]:74.6% - 91.5%)、81.7%(95% CI:71.8% - 90.1%)和87.3%(95% CI:78.9% - 94.4%),尽管其对CVR检测的敏感性较低(3/11,27.3%)。人类读者在立即进行放射学报告方面的准确率为62.0% - 84.5%,KTAS 1 - 2级病例为73.2% - 84.5%,CVR病例为39.4% - 94.4%。Gemini - Flash t1在识别需要立即进行放射学报告的病例方面的准确率与最准确的人类读者相当(与儿科专家相比:84.5%[95% CI:76.1% - 93.0%];<0.99)。
在识别需要立即进行放射学报告的病例方面,多模态大语言模型可能实现与人类读者相当或更高的总体准确率,这支持了它们在儿科放射学工作列表优先级排序中的潜在应用。然而,这些模型在检测此类病例时的敏感性并不可靠。