大语言模型在检测儿科放射学中需要立即报告的病例方面的准确性：一项使用公开临床病例摘要的可行性研究

Accuracy of Large Language Models in Detecting Cases Requiring Immediate Reporting in Pediatric Radiology: A Feasibility Study Using Publicly Available Clinical Vignettes.

作者信息

Park Jun Sung, Hwang Jisun, Kim Pyeong Hwa, Shim Woo Hyun, Seo Min Jeong, Kim Dahyun, Shin Jeong In, Kim In Hwa, Heo Hwon, Suh Chong Hyun

机构信息

Department of Pediatrics, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea.

Department of Radiology, Seoul National University Bundang Hospital, Seoul National University College of Medicine, Seongnam, Republic of Korea.

出版信息

Korean J Radiol. 2025 Sep;26(9):855-866. doi: 10.3348/kjr.2025.0240.

DOI:10.3348/kjr.2025.0240

PMID:40873376

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12394824/

Abstract

OBJECTIVE

To evaluate the accuracy of multimodal large language models (LLMs) in detecting cases requiring immediate radiology reporting in pediatric radiology.

MATERIALS AND METHODS

Seventy-one publicly available, paraphrased pediatric clinical vignettes with images-sourced from the , , , and -were assessed by seven vision-capable LLMs (temperature levels 0 and 1; t0 and t1) and four human readers (an expert pediatric radiologist, a trainee radiologist, an expert pediatrician, and a trainee pediatrician). Cases were classified as requiring immediate reporting (n = 33) if they corresponded to Korean Triage and Acuity Scale (KTAS) levels 1-2 (n = 24) or met the criteria for a critical value report (CVR) (n = 11). The most accurate LLM was compared with each human reader, with significance set at < 0.013.

RESULTS

LLMs demonstrated 60.6%-83.1% accuracy in detecting cases requiring immediate radiology reporting: 57.7%-81.7% and 53.5%-87.3% for KTAS levels 1-2 and CVR cases, respectively. Gemini-Flash with t1 showed the highest accuracy among the LLMs: 83.1% (95% confidence interval [CI]: 74.6%-91.5%), 81.7% (95% CI: 71.8%-90.1%), and 87.3% (95% CI: 78.9%-94.4%) for identifying cases requiring immediate reporting, KTAS level 1-2 cases, and CVR cases, respectively, despite its low sensitivity for CVR detection (3/11, 27.3%). Human readers demonstrated 62.0%-84.5% accuracy for immediate radiology reporting, 73.2%-84.5% for KTAS levels 1-2, and 39.4%-94.4% for CVR cases. The accuracy of Gemini-Flash t1 in identifying cases requiring immediate radiology reporting was comparable to that of the most accurate human reader (vs. expert pediatrician: 84.5% [95% CI: 76.1%-93.0%]; < 0.99).

CONCLUSION

Multimodal LLMs may achieve overall accuracy comparable to or exceeding that of human readers in identifying cases requiring immediate radiology reporting, supporting their potential use for pediatric radiology worklist prioritization. However, the models' sensitivity in detecting such cases was not reliable.

摘要

目的

评估多模态大语言模型（LLMs）在检测儿科放射学中需要立即进行放射学报告的病例时的准确性。

材料与方法

从[具体来源1]、[具体来源2]、[具体来源3]和[具体来源4]获取了71个公开可用的带有图像的儿科临床病例摘要，并由7个具备视觉能力的大语言模型（温度水平0和1；t0和t1）以及4名人类读者（一名儿科放射学专家、一名放射学实习医生、一名儿科专家和一名儿科实习医生）进行评估。如果病例符合韩国分诊和 acuity 量表（KTAS）1 - 2级（n = 24）或满足危急值报告（CVR）标准（n = 11），则将其分类为需要立即报告（n = 33）。将最准确的大语言模型与每位人类读者进行比较，显著性设定为<0.013。

结果

大语言模型在检测需要立即进行放射学报告的病例时准确率为60.6% - 83.1%：KTAS 1 - 2级病例的准确率为57.7% - 81.7%，CVR病例的准确率为53.5% - 87.3%。t1温度水平的Gemini - Flash在大语言模型中显示出最高的准确率：识别需要立即报告的病例、KTAS 1 - 2级病例和CVR病例的准确率分别为83.1%（95%置信区间[CI]：74.6% - 91.5%）、81.7%（95% CI：71.8% - 90.1%）和87.3%（95% CI：78.9% - 94.4%），尽管其对CVR检测的敏感性较低（3/11，27.3%）。人类读者在立即进行放射学报告方面的准确率为62.0% - 84.5%，KTAS 1 - 2级病例为73.2% - 84.5%，CVR病例为39.4% - 94.4%。Gemini - Flash t1在识别需要立即进行放射学报告的病例方面的准确率与最准确的人类读者相当（与儿科专家相比：84.5%[95% CI：76.1% - 93.0%]；<0.99）。

结论

在识别需要立即进行放射学报告的病例方面，多模态大语言模型可能实现与人类读者相当或更高的总体准确率，这支持了它们在儿科放射学工作列表优先级排序中的潜在应用。然而，这些模型在检测此类病例时的敏感性并不可靠。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ca5/12394824/f5a55445e9a3/kjr-26-855-g001.jpg

相似文献

Accuracy of Large Language Models in Detecting Cases Requiring Immediate Reporting in Pediatric Radiology: A Feasibility Study Using Publicly Available Clinical Vignettes.

Korean J Radiol. 2025 Sep;26(9):855-866. doi: 10.3348/kjr.2025.0240.

Prescription of Controlled Substances: Benefits and Risks

A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study.

JMIR Form Res. 2025 Aug 19;9:e70863. doi: 10.2196/70863.

Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.

BMC Emerg Med. 2025 Sep 1;25(1):176. doi: 10.1186/s12873-025-01337-2.

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

An Institutional Large Language Model for Musculoskeletal MRI Improves Protocol Adherence and Accuracy.

J Bone Joint Surg Am. 2025 Jul 8. doi: 10.2106/JBJS.24.01429.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

Artificial intelligence for diagnosing exudative age-related macular degeneration.

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

Are Artificial Intelligence Models Reliable for Clinical Application in Pediatric Fracture Detection on Radiographs? A Systematic Review and Meta-analysis.

Clin Orthop Relat Res. 2025 Aug 20. doi: 10.1097/CORR.0000000000003660.

本文引用的文献

Crucial Role of Understanding in Human-Artificial Intelligence Interaction for Successful Clinical Adoption.

Korean J Radiol. 2025 Apr;26(4):287-290. doi: 10.3348/kjr.2025.0071. Epub 2025 Feb 17.

Adherence of Studies on Large Language Models for Medical Applications Published in Leading Medical Journals According to the MI-CLEAR-LLM Checklist.

Korean J Radiol. 2025 Apr;26(4):304-312. doi: 10.3348/kjr.2024.1161. Epub 2025 Jan 23.

Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports.

NPJ Digit Med. 2025 Feb 12;8(1):97. doi: 10.1038/s41746-025-01488-3.

Efficacy of Fine-Tuned Large Language Model in CT Protocol Assignment as Clinical Decision-Supporting System.

J Imaging Inform Med. 2025 Feb 5. doi: 10.1007/s10278-025-01433-6.

Comparing Large Language Model and Human Reader Accuracy with Image Challenge Case Image Inputs.

Radiology. 2024 Dec;313(3):e241668. doi: 10.1148/radiol.241668.

Impact of human and artificial intelligence collaboration on workload reduction in medical image interpretation.

NPJ Digit Med. 2024 Nov 30;7(1):349. doi: 10.1038/s41746-024-01328-w.

Insufficient Transparency in Stochasticity Reporting in Large Language Model Studies for Medical Applications in Leading Medical Journals.

Korean J Radiol. 2024 Nov;25(11):1029-1031. doi: 10.3348/kjr.2024.0788.

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department.

Nat Commun. 2024 Oct 8;15(1):8236. doi: 10.1038/s41467-024-52415-1.

Effects of artificial intelligence implementation on efficiency in medical imaging-a systematic literature review and meta-analysis.

NPJ Digit Med. 2024 Sep 30;7(1):265. doi: 10.1038/s41746-024-01248-9.

Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM).

Korean J Radiol. 2024 Oct;25(10):865-868. doi: 10.3348/kjr.2024.0843.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型在检测儿科放射学中需要立即报告的病例方面的准确性：一项使用公开临床病例摘要的可行性研究

Accuracy of Large Language Models in Detecting Cases Requiring Immediate Reporting in Pediatric Radiology: A Feasibility Study Using Publicly Available Clinical Vignettes.

作者信息

Park Jun Sung, Hwang Jisun, Kim Pyeong Hwa, Shim Woo Hyun, Seo Min Jeong, Kim Dahyun, Shin Jeong In, Kim In Hwa, Heo Hwon, Suh Chong Hyun

机构信息

Department of Pediatrics, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea.

Department of Radiology, Seoul National University Bundang Hospital, Seoul National University College of Medicine, Seongnam, Republic of Korea.

出版信息

Korean J Radiol. 2025 Sep;26(9):855-866. doi: 10.3348/kjr.2025.0240.

DOI:10.3348/kjr.2025.0240

PMID:40873376

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12394824/

Abstract

OBJECTIVE

To evaluate the accuracy of multimodal large language models (LLMs) in detecting cases requiring immediate radiology reporting in pediatric radiology.

MATERIALS AND METHODS

RESULTS

CONCLUSION

摘要

目的

评估多模态大语言模型（LLMs）在检测儿科放射学中需要立即进行放射学报告的病例时的准确性。

大语言模型在检测儿科放射学中需要立即报告的病例方面的准确性：一项使用公开临床病例摘要的可行性研究

Accuracy of Large Language Models in Detecting Cases Requiring Immediate Reporting in Pediatric Radiology: A Feasibility Study Using Publicly Available Clinical Vignettes.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

大语言模型在检测儿科放射学中需要立即报告的病例方面的准确性：一项使用公开临床病例摘要的可行性研究

Accuracy of Large Language Models in Detecting Cases Requiring Immediate Reporting in Pediatric Radiology: A Feasibility Study Using Publicly Available Clinical Vignettes.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

本文引用的文献