Kim Hokun, Kim Bohyun, Choi Moon Hyung, Choi Joon-Il, Oh Soon Nam, Rha Sung Eun
Department of Radiology, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.
Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.
Korean J Radiol. 2025 Jun;26(6):557-568. doi: 10.3348/kjr.2024.1228. Epub 2025 Apr 17.
To evaluate the feasibility of generative pre-trained transformer-4 (GPT-4) in generating structured reports (SRs) from mixed-language (English and Korean) narrative-style CT reports for pancreatic ductal adenocarcinoma (PDAC) and to assess its accuracy in categorizing PDCA resectability.
This retrospective study included consecutive free-text reports of pancreas-protocol CT for staging PDAC, from two institutions, written in English or Korean from January 2021 to December 2023. Both the GPT-4 Turbo and GPT-4o models were provided prompts along with the free-text reports via an application programming interface and tasked with generating SRs and categorizing tumor resectability according to the National Comprehensive Cancer Network guidelines version 2.2024. Prompts were optimized using the GPT-4 Turbo model and 50 reports from Institution B. The performances of the GPT-4 Turbo and GPT-4o models in the two tasks were evaluated using 115 reports from Institution A. Results were compared with a reference standard that was manually derived by an abdominal radiologist. Each report was consecutively processed three times, with the most frequent response selected as the final output. Error analysis was guided by the decision rationale provided by the models.
Of the 115 narrative reports tested, 96 (83.5%) contained both English and Korean. For SR generation, GPT-4 Turbo and GPT-4o demonstrated comparable accuracies (92.3% [1592/1725] and 92.2% [1590/1725], respectively; = 0.923). In the resectability categorization, GPT-4 Turbo showed higher accuracy than GPT-4o (81.7% [94/115] vs. 67.0% [77/115], respectively; = 0.002). In the error analysis of GPT-4 Turbo, the SR generation error rate was 7.7% (133/1725 items), which was primarily attributed to inaccurate data extraction (54.1% [72/133]). The resectability categorization error rate was 18.3% (21/115), with the main cause being violation of the resectability criteria (61.9% [13/21]).
Both GPT-4 Turbo and GPT-4o demonstrated acceptable accuracy in generating NCCN-based SRs on PDACs from mixed-language narrative reports. However, oversight by human radiologists is essential for determining resectability based on CT findings.
评估生成式预训练变换器4(GPT-4)从胰腺导管腺癌(PDAC)的混合语言(英语和韩语)叙述式CT报告生成结构化报告(SR)的可行性,并评估其在对PDAC可切除性进行分类方面的准确性。
这项回顾性研究纳入了2021年1月至2023年12月期间来自两个机构的连续的胰腺协议CT的自由文本报告,这些报告用于PDAC分期,用英语或韩语书写。通过应用程序编程接口向GPT-4 Turbo和GPT-4o模型提供自由文本报告并给出提示,要求它们生成SR,并根据《美国国立综合癌症网络(NCCN)指南》第2.2024版对肿瘤可切除性进行分类。使用GPT-4 Turbo模型和来自机构B的50份报告对提示进行了优化。使用来自机构A的115份报告评估GPT-4 Turbo和GPT-4o模型在这两项任务中的表现。将结果与由腹部放射科医生手动得出的参考标准进行比较。每份报告连续处理三次,选择出现频率最高的回答作为最终输出。错误分析以模型提供的决策依据为指导。
在测试的115份叙述性报告中,96份(83.5%)包含英语和韩语。对于SR生成,GPT-4 Turbo和GPT-4o表现出相当的准确性(分别为92.3%[1592/1725]和92.2%[1590/1725];κ = 0.923)。在可切除性分类方面,GPT-4 Turbo显示出比GPT-4o更高的准确性(分别为81.7%[94/115]和67.0%[77/115];P = 0.002)。在GPT-4 Turbo的错误分析中,SR生成错误率为7.7%(133/1725项),这主要归因于数据提取不准确(54.1%[72/133])。可切除性分类错误率为18.3%(21/115),主要原因是违反了可切除性标准(61.9%[13/21])。
GPT-4 Turbo和GPT-4o在从混合语言叙述性报告生成基于NCCN的PDAC SR方面都表现出了可接受的准确性。然而,人类放射科医生的监督对于根据CT结果确定可切除性至关重要。