Li Kathryn W, Lacson Ronilda, Guenette Jeffrey P, DiPiro Pamela J, Burk Kristine S, Kapoor Neena, Salah Fatima, Khorasani Ramin
Center for Evidence-Based Imaging, Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, 1620 Tremont St, Boston, MA 02120.
AJR Am J Roentgenol. 2025 Apr;224(4):e2432341. doi: 10.2214/AJR.24.32341. Epub 2025 Jan 29.
Automated extraction of actionable details of recommendations for additional imaging (RAIs) from radiology reports could facilitate tracking and timely completion of clinically necessary RAIs and thereby potentially reduce diagnostic delays. The purpose of the study was to assess the performance of large language models (LLMs) in extracting actionable details of RAIs from radiology reports. This retrospective single-center study evaluated reports of diagnostic radiology examinations performed across modalities and care settings within five subspecialties (abdominal imaging, musculoskeletal imaging, neuroradiology, nuclear medicine, thoracic imaging) in August 2023. Of reports identified by a previously validated natural language processing algorithm to contain an RAI, 250 were randomly selected; 231 of these reports were confirmed to contain an RAI on manual review and formed the study sample. Twenty-five reports were used to engineer a prompt instructing an LLM, when inputted in a report impression containing an RAI, to extract details about the modality, body part, time frame, and rationale of the RAI; the remaining 206 reports were used for testing the prompt in combination with GPT-3.5 and GPT-4. A 4th-year medical student and radiologist from the relevant subspecialty independently classified the LLM outputs as correct versus incorrect for extracting the four actionable details of RAIs in comparison with the report impressions; a third reviewer assisted in resolving discrepancies. Extraction accuracy was summarized and compared between LLMs using consensus assessments. For GPT-3.5 and GPT-4, the two reviewers agreed about classification of LLM output as correct versus incorrect with respect to report impressions for 95.6% and 94.2% for RAI modality, 89.3% and 88.3% for RAI body part, 96.1% and 95.1% for RAI time frame, and 89.8% and 88.8% for RAI rationale, respectively. GPT-4 was more accurate than GPT-3.5 in extracting RAI modality (94.2% [194/206] vs 85.4% [176/206], < .001), RAI body part (86.9% [179/206] vs 77.2% [159/206], = .004), and RAI time frame (99.0% [204/206] vs 95.6% [197/206], = .02). Both LLMs had accuracy of 91.7% (189/206) for extracting RAI rationale. LLMs were used to extract actionable details of RAIs from free-text impression sections of radiology reports; GPT-4 outperformed GPT-3.5. The technique could represent an innovative method to facilitate timely completion of clinically necessary radiologist recommendations.
从放射学报告中自动提取关于额外影像检查(RAIs)建议的可操作细节,有助于跟踪并及时完成临床必要的RAIs,从而有可能减少诊断延迟。本研究的目的是评估大语言模型(LLMs)从放射学报告中提取RAIs可操作细节的性能。这项回顾性单中心研究评估了2023年8月在五个亚专业(腹部影像、肌肉骨骼影像、神经放射学、核医学、胸部影像)中跨模态和护理环境进行的诊断性放射学检查报告。在通过先前验证的自然语言处理算法识别出包含RAI的报告中,随机选择了250份;其中231份报告经人工审核确认包含RAI,构成了研究样本。25份报告用于设计一个提示,当输入包含RAI的报告印象时,指导LLM提取RAI的模态、身体部位、时间框架和理由的细节;其余206份报告用于结合GPT - 3.5和GPT - 4测试该提示。一名四年级医学生和来自相关亚专业的放射科医生独立将LLM输出与报告印象相比,就提取RAIs的四个可操作细节分类为正确或不正确;第三位审阅者协助解决差异。使用共识评估总结并比较了LLMs之间的提取准确性。对于GPT - 3.5和GPT - 4,两位审阅者在RAI模态方面,就LLM输出与报告印象相比分类为正确或不正确的一致性分别为RAI模态的95.6%和RAI身体部位的94.2%,RAI身体部位的89.3%和88.3%,RAI时间框架的96.1%和95.1%,RAI理由的89.8%和88.8%。在提取RAI模态(94.2% [194/206] 对85.4% [176/206],P <.001)、RAI身体部位(86.9% [179/206] 对77.2% [159/206],P = 0.004)和RAI时间框架(99.0% [204/206] 对95.6% [197/206],P = 0.02)方面,GPT - 4比GPT - 3.5更准确。两种LLMs在提取RAI理由方面的准确率均为91.7%(189/206)。LLMs用于从放射学报告的自由文本印象部分提取RAIs的可操作细节;GPT - 4的表现优于GPT - 3.5。该技术可能代表一种创新方法,有助于及时完成临床必要的放射科医生建议。