Suppr超能文献

使用大语言模型解读无BI-RADS的乳腺MRI报告:利用ChatGPT从叙述性报告中进行自动BI-RADS分类

Interpreting BI-RADS-Free Breast MRI Reports Using a Large Language Model: Automated BI-RADS Classification From Narrative Reports Using ChatGPT.

作者信息

Tekcan Sanli Deniz Esin, Sanli Ahmet Necati, Ozmen Gizem, Ozmen Aycil, Cihan Irem, Kurt Atakan, Esmerer Emel

机构信息

Department of Radiology, Faculty of Medicine, Gaziantep University, Gaziantep, Turkey (D.E.T.S., G.O., A.O., I.C., A.K.).

Department of General Surgery, Abdulkadir Yüksel State Hospital, Gaziantep, Turkey (A.N.S.).

出版信息

Acad Radiol. 2025 Sep 6. doi: 10.1016/j.acra.2025.08.026.

Abstract

PURPOSE

This study aimed to evaluate the performance of ChatGPT (GPT-4o) in interpreting free-text breast magnetic resonance imaging (MRI) reports by assigning BI-RADS categories and recommending appropriate clinical management steps in the absence of explicitly stated BI-RADS classifications.

METHODS

In this retrospective, single-center study, a total of 352 documented full-text breast MRI reports of at least one identifiable breast lesion with descriptive imaging findings between January 2024 and June 2025 were included in the study. Incomplete reports due to technical limitations, reports describing only normal findings, and MRI examinations performed at external institutions were excluded from the study. First, it was aimed to assess ChatGPT's ability to infer the correct BI-RADS category (2-3-4a-4b-4c-5 separately) based solely on the narrative imaging findings. Second, it was evaluated the model's ability to distinguish between benign versus suspicious/malignant imaging features in terms of clinical decision-making. Therefore, BI-RADS 2-3 categories were grouped as "benign," and BI-RADS 4-5 as "suspicious/malignant," in alignment with how BI-RADS categories are used to guide patient management, rather than to represent definitive diagnostic outcomes. Reports originally containing the term "BI-RADS" were manually de-identified by removing BI-RADS categories and clinical recommendations. Each narrative report was then processed through ChatGPT using two standardized prompts as follows: (1) What is the most appropriate BI-RADS category based on the findings in the report? (2) What should be the next clinical step (e.g., follow-up, biopsy)? Responses were evaluated in real time by two experienced breast radiologists, and consensus was used as the reference standard.

RESULTS

ChatGPT demonstrated moderate agreement with radiologists' consensus for BI-RADS classification (Cohen's Kappa (κ): 0.510, p<0.001). Classification accuracy was highest for BI-RADS 5 reports (77.9%), whereas lower agreement was observed in intermediate categories such as BI-RADS 3 (52.4% correct) and 4B (29.4% correct). In the binary classification of reports as benign or malignant, ChatGPT achieved almost perfect agreement (κ: 0.843), correctly identifying 91.7% of benign and 93.2% of malignant reports. Notably, the model's management recommendations were 100% consistent with its assigned BI-RADS categories, advising biopsy for all BI-RADS 4-5 cases and short-interval follow-up or conditional biopsy for BI-RADS 3 reports.

CONCLUSION

ChatGPT accurately interprets unstructured breast MRI reports, particularly in benign/malignant discrimination and corresponding clinical recommendations. This technology holds potential as a decision support tool to standardize reporting and enhance clinical workflows, especially in settings with variable reporting practices. Prospective, multi-institutional studies are needed for further validation.

摘要

目的

本研究旨在评估ChatGPT(GPT - 4o)在解读自由文本乳腺磁共振成像(MRI)报告方面的表现,即在未明确给出BI - RADS分类的情况下,对乳腺病变进行BI - RADS分类并推荐适当的临床管理步骤。

方法

在这项回顾性单中心研究中,纳入了2024年1月至2025年6月期间352份记录完整的乳腺MRI报告,每份报告至少有一个可识别的乳腺病变且带有描述性影像表现。因技术限制导致的不完整报告、仅描述正常表现的报告以及在外部机构进行的MRI检查被排除在研究之外。首先,旨在评估ChatGPT仅根据叙述性影像表现推断正确BI - RADS分类(分别为2 - 3 - 4a - 4b - 4c - 5)的能力。其次,在临床决策方面评估该模型区分良性与可疑/恶性影像特征的能力。因此,按照BI - RADS分类用于指导患者管理的方式,将BI - RADS 2 - 3类归为“良性”,BI - RADS 4 - 5类归为“可疑/恶性”,而非代表确定性诊断结果。最初包含“BI - RADS”一词的报告通过去除BI - RADS分类和临床建议进行人工去识别。然后,每份叙述性报告通过ChatGPT使用以下两个标准化提示进行处理:(1)根据报告中的发现,最合适的BI - RADS分类是什么?(2)接下来的临床步骤应该是什么(例如,随访、活检)?两名经验丰富的乳腺放射科医生实时评估回复,并将达成的共识用作参考标准。

结果

ChatGPT在BI - RADS分类方面与放射科医生的共识显示出中等程度的一致性(Cohen's Kappa(κ):0.51),p<0.001)。BI - RADS 5类报告的分类准确率最高(77.9%),而在诸如BI - RADS 3(正确52.4%)和4B(正确29.4%)等中间类别中一致性较低。在将报告分为良性或恶性的二元分类中,ChatGPT达成了几乎完美的一致性(κ:0.843),正确识别了91.7%的良性报告和93.2%的恶性报告。值得注意的是,该模型的管理建议与其指定的BI - RADS分类100%一致,建议所有BI - RADS 4 - 5类病例进行活检,BI - RADS 3类报告进行短期随访或有条件活检。

结论

ChatGPT能够准确解读非结构化的乳腺MRI报告,尤其是在良性/恶性鉴别及相应临床建议方面。该技术作为一种决策支持工具具有潜力,可规范报告并优化临床工作流程,特别是在报告实践存在差异的环境中。需要进行前瞻性、多机构研究以进一步验证。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验