Kim Songsoo, Kim Donghyun, Kim Jaewoong, Koo Jalim, Yoon Jinsik, Yoon Dukyong
Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Korea.
Department of Radiology, Central Draft Physical Examination Office of Military Manpower Administration, Daegu, Korea.
Healthc Inform Res. 2025 Jul;31(3):295-309. doi: 10.4258/hir.2025.31.3.295. Epub 2025 Jul 31.
This study assessed the effectiveness of in-context learning using Generative Pre-trained Transformer-4 (GPT-4) for labeling radiology reports.
In this retrospective study, radiology reports were obtained from the Medical Information Mart for Intensive Care III database. Two structured prompts-the "basic prompt" and the "in-context prompt"- were compared. An optimization experiment was conducted to assess consistency and the occurrence of output format errors. The primary labeling experiments were performed on 200 unseen head computed tomography (CT) reports for multilabel classification of predefined labels (Experiment 1) and on 400 unseen abdominal CT reports for multi-label classification of actionable findings (Experiment 2).
The inter-reader accuracies in Experiments 1 and 2 were 0.93 and 0.84, respectively. For multi-label classification of head CT reports (Experiment 1), the in-context prompt led to notable increases in F1-scores for the "foreign body" and "mass" labels (gains of 0.66 and 0.22, respectively). However, improvements for other labels were modest. In multi-label classification of abdominal CT reports (Experiment 2), in-context prompts produced substantial improvements in F1-scores across all labels compared to basic prompts. Providing context equipped the model with domain-specific knowledge and helped align its existing knowledge, thereby improving performance.
Incontext learning with GPT-4 consistently improved performance in labeling radiology reports. This approach is particularly effective for subjective labeling tasks and allows the model to align its criteria with those of human annotators for objective labeling. This practical strategy offers a simple, adaptable, and researcher-oriented method that can be applied to diverse labeling tasks.
本研究评估了使用生成式预训练变换器4(GPT-4)进行上下文学习以标注放射学报告的有效性。
在这项回顾性研究中,从重症监护医学信息数据库III获取放射学报告。比较了两个结构化提示——“基本提示”和“上下文提示”。进行了一项优化实验以评估一致性和输出格式错误的发生率。主要标注实验在200份未见过的头部计算机断层扫描(CT)报告上进行,用于对预定义标签进行多标签分类(实验1),并在400份未见过的腹部CT报告上进行,用于对可操作发现进行多标签分类(实验2)。
实验1和实验2中读者间的准确率分别为0.93和0.84。对于头部CT报告的多标签分类(实验1),上下文提示使“异物”和“肿块”标签的F1分数显著提高(分别提高了0.66和0.22)。然而,其他标签的改善幅度较小。在腹部CT报告的多标签分类(实验2)中,与基本提示相比,上下文提示使所有标签的F1分数都有显著提高。提供上下文为模型提供了特定领域的知识,并有助于使其现有知识对齐,从而提高性能。
使用GPT-4进行上下文学习在标注放射学报告方面持续提高了性能。这种方法对于主观标注任务特别有效,并允许模型在客观标注时使其标准与人类注释者的标准对齐。这种实用策略提供了一种简单、可适应且以研究人员为导向的方法,可应用于各种标注任务。