ChatGPT-4o在乳腺癌筛查中的评估：来自第5版BI-RADS乳腺影像报告和数据系统图谱及美国放射学会指南的见解

Evaluation of ChatGPT-4o in Breast Cancer Screening: Insights from the 5th Edition BI-RADS Atlas and ACR Guidelines.

作者信息

Özer Bilgen Mehpare, Korkmaz Eda Nur

机构信息

Department of Radiology, Sincan Training and Research Hospital, Ankara, Turkey.

出版信息

J Imaging Inform Med. 2025 Sep 12. doi: 10.1007/s10278-025-01663-8.

DOI:10.1007/s10278-025-01663-8

PMID:40940588

Abstract

The aim of this study is to evaluate the potential, reliability, and limitations of ChatGPT-4o in text-based questions and its effectiveness in clinical decision support processes based on the 5th edition of the BI-RADS Atlas and ACR breast cancer screening guidelines. In this study, a total of 100 questions-50 multiple-choice and 50 true/false-prepared by two radiologists were submitted to ChatGPT-4o between November 5 and 19. The answers provided by ChatGPT-4o were evaluated at baseline and 14 days later by both radiologists for accuracy and comprehensiveness using a Likert scale. Group comparisons were performed using Mann-Whitney U, Wilcoxon tests; response consistency was evaluated with Cohen's Kappa, and overall accuracy differences with a two-proportion z-test. The increase in overall accuracy from 86 to 95% was statistically significant according to the two-proportion z-test (p = .030). Comparisons between the two sessions revealed statistically significant increases in the accuracy (p = .013, r = .35, 95% CI [0.09, 0.61]) and comprehensiveness (p = .014, r = .35, 95% CI [0.09, 0.61]) rates of true/false questions. On the other hand, no significant difference was found between the accuracy (p = .180, r = .19, 95% CI [- 0.09, 0.47]) and comprehensiveness (p = .180, r = .19, 95% CI [- 0.09, 0.47]) rates of multiple-choice questions. In addition, group comparisons evaluating the effect of different question formats on performance revealed no significant difference in terms of accuracy (p = .661, r = - 0.04, 95% CI [- 0.23, 0.16]) and comprehensiveness (p = .708, r = - 0.04, 95% CI [- 0.23, 0.16]). The consistency of ChatGPT-4o responses was supported by Cohen's Kappa coefficients, all statistically significant (p < .001), with 95% confidence intervals ranging from - .038 to 1.084. ChatGPT-4o demonstrated remarkable performance in answering multiple-choice and true-false questions with overall accuracy improving from 86% in the first test to 95% after 14 days. ChatGPT-4o holds significant potential as a clinical decision support tool for healthcare professionals.

摘要

本研究旨在评估ChatGPT-4o在基于文本的问题中的潜力、可靠性和局限性，以及其在基于第5版BI-RADS图谱和美国放射学会（ACR）乳腺癌筛查指南的临床决策支持过程中的有效性。在本研究中，11月5日至19日期间，两名放射科医生准备的总共100个问题（50个多项选择题和50个是非题）被提交给ChatGPT-4o。ChatGPT-4o提供的答案在基线时以及14天后由两位放射科医生使用李克特量表评估其准确性和全面性。使用曼-惠特尼U检验、威尔科克森检验进行组间比较；使用科恩卡方系数评估回答的一致性，使用双比例z检验评估总体准确性差异。根据双比例z检验，总体准确性从86%提高到95%具有统计学意义（p = 0.030）。两次评估之间的比较显示，是非题的准确性（p = 0.013，r = 0.35，95%可信区间[0.09, 0.61]）和全面性（p = 0.014，r = 0.35，95%可信区间[0.09, 0.61]）率有统计学意义的提高。另一方面，多项选择题的准确性（p = 0.180，r = 0.19，95%可信区间[-0.09, 0.47]）和全面性（p = 0.180，r = 0.19，95%可信区间[-0.09, 0.47]）率之间没有显著差异。此外，评估不同问题格式对表现影响的组间比较显示，在准确性（p = 0.661，r = -0.04，95%可信区间[-0.23, 0.16]）和全面性（p = 0.708，r = -0.04，95%可信区间[-0.23, 0.16]）方面没有显著差异。ChatGPT-4o回答的一致性得到了科恩卡方系数的支持，所有系数均具有统计学意义（p < 0.001），95%置信区间为-0.038至1.084。ChatGPT-4o在回答多项选择题和是非题方面表现出色，总体准确性从第一次测试的86%提高到14天后的95%。ChatGPT-4o作为医疗专业人员的临床决策支持工具具有巨大潜力。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

ChatGPT-4o在乳腺癌筛查中的评估：来自第5版BI-RADS乳腺影像报告和数据系统图谱及美国放射学会指南的见解

Evaluation of ChatGPT-4o in Breast Cancer Screening: Insights from the 5th Edition BI-RADS Atlas and ACR Guidelines.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

ChatGPT-4o在乳腺癌筛查中的评估：来自第5版BI-RADS乳腺影像报告和数据系统图谱及美国放射学会指南的见解

Evaluation of ChatGPT-4o in Breast Cancer Screening: Insights from the 5th Edition BI-RADS Atlas and ACR Guidelines.

作者信息

机构信息

出版信息

相似文献

本文引用的文献