• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估用于自动报告和数据系统分类的大语言模型:横断面研究。

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study.

作者信息

Wu Qingxia, Wu Qingxia, Li Huali, Wang Yan, Bai Yan, Wu Yaping, Yu Xuan, Li Xiaodong, Dong Pei, Xue Jon, Shen Dinggang, Wang Meiyun

机构信息

Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China.

Research Intelligence Department, Beijing United Imaging Research Institute of Intelligent Imaging, Beijing, China.

出版信息

JMIR Med Inform. 2024 Jul 17;12:e55799. doi: 10.2196/55799.

DOI:10.2196/55799
PMID:39018102
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11292156/
Abstract

BACKGROUND

Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored.

OBJECTIVE

This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies.

METHODS

This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ.

RESULTS

Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018.

CONCLUSIONS

When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.

摘要

背景

大语言模型有望改善放射学工作流程,但其在结构化放射学任务(如报告和数据系统(RADS)分类)方面的表现仍未得到探索。

目的

本研究旨在评估3个大语言模型聊天机器人——Claude-2、GPT-3.5和GPT-4——对放射学报告进行RADS分类的能力,并评估不同提示策略的影响。

方法

这项横断面研究使用30份放射学报告(每个RADS标准10份)比较了3个聊天机器人,采用三级提示策略:零样本、少样本和指南PDF告知提示。这些病例基于2018版肝脏影像报告和数据系统(LI-RADS)、2022版肺部计算机断层扫描(CT)筛查报告和数据系统(Lung-RADS)以及卵巢附件报告和数据系统(O-RADS)磁共振成像,由获得委员会认证的放射科医生精心准备。每份报告进行6次评估。两名盲法评审员在患者级RADS分类和总体评分方面评估聊天机器人的回答。使用Fleiss κ评估重复评估之间的一致性。

结果

Claude-2在少样本提示和指南PDF(提示2)下总体评分的准确率最高,在6次运行中平均准确率达到57%(17/30),k次投票准确率为50%(15/30)。在没有提示工程的情况下,所有聊天机器人表现不佳。引入结构化示例提示(提示1)提高了所有聊天机器人总体评分的准确率。提供提示2进一步提高了Claude-2的性能,而GPT-4未复制这种提升。Claude-2的重复评估一致性较高(总体评分k = 0.66,RADS分类k = 0.69),GPT-4为中等(两者均为k = 0.39),GPT-3.5为中等(总体评分k = 0.21,RADS分类k = 0.39)。所有聊天机器人在2018版LI-RADS上的准确率显著高于2022版Lung-RADS和O-RADS(P <.05);使用提示2时,Claude-2在2018版LI-RADS中的总体评分准确率最高,为75%(45/60)。

结论

当配备结构化提示和指南PDF时,Claude-2在根据LI-RADS 2018版等既定标准为放射学病例分配RADS分类方面显示出潜力。然而,当前一代聊天机器人在根据更新的RADS标准准确分类病例方面仍存在不足。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/5f99959f75a2/medinform_v12i1e55799_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/f0b396909b37/medinform_v12i1e55799_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/5a981cb6e297/medinform_v12i1e55799_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/5a2a8611d8d7/medinform_v12i1e55799_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/5f99959f75a2/medinform_v12i1e55799_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/f0b396909b37/medinform_v12i1e55799_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/5a981cb6e297/medinform_v12i1e55799_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/5a2a8611d8d7/medinform_v12i1e55799_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b2e/11292156/5f99959f75a2/medinform_v12i1e55799_fig4.jpg

相似文献

1
Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study.评估用于自动报告和数据系统分类的大语言模型:横断面研究。
JMIR Med Inform. 2024 Jul 17;12:e55799. doi: 10.2196/55799.
2
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
3
Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports.使用 GPT-4 对多语言自由文本报告进行 LI-RADS 特征提取和分类。
Liver Int. 2024 Jul;44(7):1578-1587. doi: 10.1111/liv.15891. Epub 2024 Apr 23.
4
ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language.ChatGPT在根据德语的自由文本和结构化放射学报告确定LI-RADS评分时准确率较低。
Front Radiol. 2024 Jul 5;4:1390774. doi: 10.3389/fradi.2024.1390774. eCollection 2024.
5
Performance of AI chatbots on controversial topics in oral medicine, pathology, and radiology.人工智能聊天机器人在口腔医学、病理学和放射学领域的争议性话题上的表现。
Oral Surg Oral Med Oral Pathol Oral Radiol. 2024 May;137(5):508-514. doi: 10.1016/j.oooo.2024.01.015. Epub 2024 Feb 6.
6
Programming Chatbots Using Natural Language: Generating Cervical Spine MRI Impressions.使用自然语言编程聊天机器人:生成颈椎MRI影像报告
Cureus. 2024 Sep 14;16(9):e69410. doi: 10.7759/cureus.69410. eCollection 2024 Sep.
7
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
8
Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.大型语言模型在 3 个临床专业领域的治疗推荐中的应用:比较研究。
J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.
9
The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries.与肿瘤内科医生相比,大型语言模型驱动的聊天机器人在结直肠癌相关问题上的表现。
Int J Surg. 2024 Oct 1;110(10):6509-6517. doi: 10.1097/JS9.0000000000001850.
10
Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.用于简化介入放射学报告的大语言模型:一项比较分析
Acad Radiol. 2025 Feb;32(2):888-898. doi: 10.1016/j.acra.2024.09.041. Epub 2024 Sep 30.

引用本文的文献

1
Navigating Ovarian-Adnexal Reporting and Data System Magnetic Resonance Imaging (O-RADS MRI): A Review of Its Evolution, Current Advances, and Persistent Challenges in Ovarian Imaging.解读卵巢附件报告和数据系统磁共振成像(O-RADS MRI):卵巢成像中其演变、当前进展及持续挑战的综述
Cureus. 2025 Jun 25;17(6):e86717. doi: 10.7759/cureus.86717. eCollection 2025 Jun.
2
Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS).评估隐私保护大语言模型在计算脊柱不稳定肿瘤评分(SINS)方面的准确性。
Cancers (Basel). 2025 Jun 20;17(13):2073. doi: 10.3390/cancers17132073.
3

本文引用的文献

1
Evaluation of GPT-4's Chest X-Ray Impression Generation: A Reader Study on Performance and Perception.评估 GPT-4 生成的胸部 X 光印象:一项关于性能和感知的读者研究。
J Med Internet Res. 2023 Dec 22;25:e50865. doi: 10.2196/50865.
2
Update: Lung-RADS 2022.更新:肺- RADS 2022.
Radiographics. 2023 Nov;43(11):e230037. doi: 10.1148/rg.230037.
3
Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.医学专业人员的新兴技能:提示工程教程
Evaluation of large language models in generating pulmonary nodule follow-up recommendations.
评估大语言模型在生成肺结节随访建议方面的能力。
Eur J Radiol Open. 2025 Apr 30;14:100655. doi: 10.1016/j.ejro.2025.100655. eCollection 2025 Jun.
4
Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.大型语言模型在构建头部CT放射学报告中的比较性能:日本的多机构验证研究
Jpn J Radiol. 2025 May 14. doi: 10.1007/s11604-025-01799-1.
J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.
4
O-RADS US v2022: An Update from the American College of Radiology's Ovarian-Adnexal Reporting and Data System US Committee.O-RADS US v2022:美国放射学会卵巢-附件报告和数据系统美国委员会的更新。
Radiology. 2023 Sep;308(3):e230685. doi: 10.1148/radiol.230685.
5
A Context-based Chatbot Surpasses Trained Radiologists and Generic ChatGPT in Following the ACR Appropriateness Guidelines.基于语境的聊天机器人在遵循 ACR 适宜性准则方面超越了经过培训的放射科医生和通用的 ChatGPT。
Radiology. 2023 Jul;308(1):e230970. doi: 10.1148/radiol.230970.
6
ChatGPT's Diagnostic Performance from Patient History and Imaging Findings on the Diagnosis Please Quizzes.ChatGPT在诊断问答中基于患者病史和影像检查结果的诊断性能。
Radiology. 2023 Jul;308(1):e231040. doi: 10.1148/radiol.231040.
7
Feasibility of Differential Diagnosis Based on Imaging Patterns Using a Large Language Model.基于成像模式利用大语言模型进行鉴别诊断的可行性
Radiology. 2023 Jul;308(1):e231167. doi: 10.1148/radiol.231167.
8
Utility of ChatGPT in Clinical Practice.ChatGPT 在临床实践中的应用。
J Med Internet Res. 2023 Jun 28;25:e48568. doi: 10.2196/48568.
9
Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot.评估 GPT 作为放射学决策辅助工具:GPT-4 与 GPT-3.5 在乳腺成像试点中的比较。
J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.
10
Practical Tips for Reporting Adnexal Lesions Using O-RADS MRI.O-RADS MRI 报告附件病变的实用技巧。
Radiographics. 2023 Jul;43(7):e220142. doi: 10.1148/rg.220142.