Suppr超能文献

使用胸部CT和FDG PET/CT自由文本报告进行肺癌分期:三种ChatGPT大语言模型与六位不同经验水平的人类读者的比较

Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large Language Models and Six Human Readers of Varying Experience.

作者信息

Lee Jong Eun, Park Ki-Seong, Kim Yun-Hyeon, Song Ho-Chun, Park Byunggeon, Jeong Yeon Joo

机构信息

Department of Radiology and Research Institute of Radiology, Asan Medical Center, Seoul, Korea.

Department of Nuclear Medicine, Chonnam National University Hospital, Gwangju, Korea.

出版信息

AJR Am J Roentgenol. 2024 Dec;223(6):e2431696. doi: 10.2214/AJR.24.31696. Epub 2024 Sep 4.

Abstract

Although radiology reports are commonly used for lung cancer staging, this task can be challenging given radiologists' variable reporting styles as well as reports' potentially ambiguous and/or incomplete staging-related information. The purpose of this study was to compare the performance of ChatGPT large language models (LLMs) and human readers of varying experience in lung cancer staging using chest CT and FDG PET/CT free-text reports. This retrospective study included 700 patients (mean age, 73.8 ± 29.5 [SD] years; 509 men, 191 women) from four institutions in Korea who underwent chest CT or FDG PET/CT for non-small cell lung cancer initial staging from January 2020 to December 2023. Examinations' reports used a free-text format, written exclusively in English or in mixed English and Korean. Two thoracic radiologists in consensus determined the overall stage group (IA, IB, IIA, IIB, IIIA, IIIB, IIIC, IVA, or IVB) for each report using the 8th-edition to establish the reference standard. Three ChatGPT models (GPT-4o, GPT-4, GPT-3.5) determined an overall stage group for each report using a script-based application programming interface, zero-shot learning, and a prompt incorporating a staging system summary. The code for this web application was made publicly available through a GitHub repository (https://github.com/elmidion/GPT_Information_Extractor). Six human readers (two fellowship-trained radiologists with less experience than the radiologists who determined the reference standard, two fellows, and two residents) also independently determined overall stage groups. GPT-4o's overall accuracy for determining the correct stage among the nine groups was compared with that of the other LLMs and human readers using McNemar tests. GPT-4o had an overall staging accuracy of 74.1%, significantly better than the accuracy of GPT-4 (70.1%, = .02), GPT-3.5 (57.4%, < .001), and resident 2 (65.7%, < .001); significantly worse than the accuracy of fellowship-trained radiologist 1 (82.3%, < .001) and fellowship-trained radiologist 2 (85.4%, < .001); and not significantly different from the accuracy of fellow 1 (77.7%, = .09), fellow 2 (75.6%, = .53), and resident 1 (72.3%, = .42). The best-performing model, GPT-4o, showed no significant difference in staging accuracy versus fellows but showed significantly worse performance versus fellowship-trained radiologists. The findings do not support use of LLMs for lung cancer staging in place of expert health care professionals. The findings indicate the importance of domain expertise for performing complex specialized tasks such as cancer staging.

摘要

尽管放射学报告常用于肺癌分期,但鉴于放射科医生的报告风格各异,以及报告中与分期相关的信息可能存在模糊和/或不完整的情况,这项任务颇具挑战性。本研究的目的是比较ChatGPT大语言模型(LLMs)和不同经验的人类读者在使用胸部CT和FDG PET/CT自由文本报告进行肺癌分期方面的表现。这项回顾性研究纳入了来自韩国四个机构的700名患者(平均年龄73.8±29.5[标准差]岁;男性509名,女性191名),这些患者在2020年1月至2023年12月期间因非小细胞肺癌初始分期接受了胸部CT或FDG PET/CT检查。检查报告采用自由文本格式,仅用英文书写或英文与韩文混合书写。两位胸科放射科医生达成共识,使用第8版确定每份报告的总体分期组(IA、IB、IIA、IIB、IIIA、IIIB、IIIC、IVA或IVB),以建立参考标准。三个ChatGPT模型(GPT-4o、GPT-4、GPT-3.5)使用基于脚本的应用程序编程接口、零样本学习以及包含分期系统总结的提示,确定每份报告的总体分期组。此网络应用程序的代码通过GitHub存储库(https://github.com/elmidion/GPT_Information_Extractor)公开提供。六位人类读者(两位接受过 fellowship 培训但经验少于确定参考标准的放射科医生的放射科医生、两位 fellowship 学员和两位住院医生)也独立确定总体分期组。使用McNemar检验比较GPT-4o在九个组中确定正确分期的总体准确率与其他大语言模型和人类读者的准确率。GPT-4o的总体分期准确率为74.1%,显著高于GPT-4的准确率(70.1%,P = 0.02)、GPT-3.5的准确率(57.4%,P < 0.001)和住院医生2的准确率(65.7%,P < 0.001);显著低于接受过 fellowship 培训的放射科医生1的准确率(82.3%,P < 0.001)和接受过 fellowship 培训的放射科医生2的准确率(85.4%,P < 0.001);与 fellowship 学员1的准确率(77.7%,P = 0.09)、fellowship 学员2的准确率(75.6%,P = 0.53)和住院医生1的准确率(72.3%,P = 0.42)无显著差异。表现最佳的模型GPT-4o与fellowship学员相比,分期准确率无显著差异,但与接受过 fellowship 培训的放射科医生相比,表现显著更差。这些发现不支持使用大语言模型替代专业医疗保健人员进行肺癌分期。这些发现表明了领域专业知识在执行诸如癌症分期等复杂专业任务中的重要性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验