使用GPT-4o从放射学诊断印象中提取肺栓塞诊断：大语言模型评估研究

Extracting Pulmonary Embolism Diagnoses From Radiology Impressions Using GPT-4o: Large Language Model Evaluation Study.

作者信息

Mahyoub Mohammed, Dougherty Kacie, Shukla Ajit

机构信息

Virtua Health, Marlton, NJ, United States.

School of Systems Science and Industrial Engineering, Binghamton University, Binghamton, NY, United States.

出版信息

JMIR Med Inform. 2025 Apr 9;13:e67706. doi: 10.2196/67706.

DOI:10.2196/67706

PMID:40203306

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12018862/

Abstract

BACKGROUND

Pulmonary embolism (PE) is a critical condition requiring rapid diagnosis to reduce mortality. Extracting PE diagnoses from radiology reports manually is time-consuming, highlighting the need for automated solutions. Advances in natural language processing, especially transformer models like GPT-4o, offer promising tools to improve diagnostic accuracy and workflow efficiency in clinical settings.

OBJECTIVE

This study aimed to develop an automatic extraction system using GPT-4o to extract PE diagnoses from radiology report impressions, enhancing clinical decision-making and workflow efficiency.

METHODS

In total, 2 approaches were developed and evaluated: a fine-tuned Clinical Longformer as a baseline model and a GPT-4o-based extractor. Clinical Longformer, an encoder-only model, was chosen for its robustness in text classification tasks, particularly on smaller scales. GPT-4o, a decoder-only instruction-following LLM, was selected for its advanced language understanding capabilities. The study aimed to evaluate GPT-4o's ability to perform text classification compared to the baseline Clinical Longformer. The Clinical Longformer was trained on a dataset of 1000 radiology report impressions and validated on a separate set of 200 samples, while the GPT-4o extractor was validated using the same 200-sample set. Postdeployment performance was further assessed on an additional 200 operational records to evaluate model efficacy in a real-world setting.

RESULTS

GPT-4o outperformed the Clinical Longformer in 2 of the metrics, achieving a sensitivity of 1.0 (95% CI 1.0-1.0; Wilcoxon test, P<.001) and an F-score of 0.975 (95% CI 0.9495-0.9947; Wilcoxon test, P<.001) across the validation dataset. Postdeployment evaluations also showed strong performance of the deployed GPT-4o model with a sensitivity of 1.0 (95% CI 1.0-1.0), a specificity of 0.94 (95% CI 0.8913-0.9804), and an F-score of 0.97 (95% CI 0.9479-0.9908). This high level of accuracy supports a reduction in manual review, streamlining clinical workflows and improving diagnostic precision.

CONCLUSIONS

The GPT-4o model provides an effective solution for the automatic extraction of PE diagnoses from radiology reports, offering a reliable tool that aids timely and accurate clinical decision-making. This approach has the potential to significantly improve patient outcomes by expediting diagnosis and treatment pathways for critical conditions like PE.

摘要

背景

肺栓塞（PE）是一种危急病症，需要快速诊断以降低死亡率。手动从放射学报告中提取PE诊断耗时费力，凸显了对自动化解决方案的需求。自然语言处理的进展，尤其是像GPT-4o这样的Transformer模型，为提高临床环境中的诊断准确性和工作流程效率提供了有前景的工具。

目的

本研究旨在开发一种使用GPT-4o的自动提取系统，从放射学报告印象中提取PE诊断，以增强临床决策和工作流程效率。

方法

总共开发并评估了两种方法：一种是微调后的Clinical Longformer作为基线模型，另一种是基于GPT-4o的提取器。Clinical Longformer是一个仅编码器模型，因其在文本分类任务中的稳健性，特别是在较小规模上，而被选中。GPT-4o是一个仅解码器的遵循指令的语言模型，因其先进的语言理解能力而被选中。该研究旨在评估GPT-4o与基线Clinical Longformer相比执行文本分类的能力。Clinical Longformer在1000份放射学报告印象的数据集上进行训练，并在另一组200个样本上进行验证，而GPT-4o提取器使用相同的200个样本集进行验证。在另外200份操作记录上进一步评估部署后的性能，以评估模型在实际环境中的有效性。

结果

在验证数据集中，GPT-4o在两个指标上优于Clinical Longformer，灵敏度达到1.0（95%置信区间1.0 - 1.0；Wilcoxon检验，P <.001），F分数为0.975（95%置信区间0.9495 - 0.9947；Wilcoxon检验，P <.001）。部署后的评估还显示，部署的GPT-4o模型表现出色，灵敏度为1.0（95%置信区间1.0 - 1.0），特异性为0.94（95%置信区间0.8913 - 0.9804），F分数为0.97（95%置信区间0.9479 - 0.9908）。这种高水平的准确性支持减少人工审核，简化临床工作流程并提高诊断精度。