大型语言模型如何回答乳腺癌测验问题？GPT-3.5、GPT-4 和 Google Gemini 的比较研究。

How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini.

机构信息

Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy.

Imaging Institute of Southern Switzerland (IIMSI), Ente Ospedaliero Cantonale (EOC), Lugano, Switzerland.

出版信息

Radiol Med. 2024 Oct;129(10):1463-1467. doi: 10.1007/s11547-024-01872-1. Epub 2024 Aug 13.

DOI:10.1007/s11547-024-01872-1

PMID:39138732

Abstract

Applications of large language models (LLMs) in the healthcare field have shown promising results in processing and summarizing multidisciplinary information. This study evaluated the ability of three publicly available LLMs (GPT-3.5, GPT-4, and Google Gemini-then called Bard) to answer 60 multiple-choice questions (29 sourced from public databases, 31 newly formulated by experienced breast radiologists) about different aspects of breast cancer care: treatment and prognosis, diagnostic and interventional techniques, imaging interpretation, and pathology. Overall, the rate of correct answers significantly differed among LLMs (p = 0.010): the best performance was achieved by GPT-4 (95%, 57/60) followed by GPT-3.5 (90%, 54/60) and Google Gemini (80%, 48/60). Across all LLMs, no significant differences were observed in the rates of correct replies to questions sourced from public databases and newly formulated ones (p ≥ 0.593). These results highlight the potential benefits of LLMs in breast cancer care, which will need to be further refined through in-context training.

摘要

大型语言模型（LLMs）在医疗保健领域的应用已在处理和总结多学科信息方面显示出可喜的结果。本研究评估了三个公开可用的 LLM（GPT-3.5、GPT-4 和谷歌 Gemini-当时称为 Bard）回答 60 个多项选择题（29 个来自公共数据库，31 个由经验丰富的乳腺放射科医生新制定）的能力，这些选择题涉及乳腺癌护理的不同方面：治疗和预后、诊断和介入技术、影像解读和病理学。总体而言，LLM 之间的正确答案率存在显著差异（p=0.010）：GPT-4 的表现最佳（95%，57/60），其次是 GPT-3.5（90%，54/60）和谷歌 Gemini（80%，48/60）。在所有 LLM 中，对来自公共数据库和新制定的问题的正确回答率没有观察到显著差异（p≥0.593）。这些结果突出了 LLM 在乳腺癌护理方面的潜在益处，这将需要通过上下文训练进一步改进。

相似文献

How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini.大型语言模型如何回答乳腺癌测验问题？GPT-3.5、GPT-4 和 Google Gemini 的比较研究。

Radiol Med. 2024 Oct;129(10):1463-1467. doi: 10.1007/s11547-024-01872-1. Epub 2024 Aug 13.

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型：GPT-3.5、GPT-4 和 Bard 的比较分析。

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study.BI-RADS 类别分配由 GPT-3.5、GPT-4 和谷歌巴德完成：一项多语言研究。

Radiology. 2024 Apr;311(1):e232133. doi: 10.1148/radiol.232133.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.人工智能在麻醉学 board 式考试问题中的应用：大语言模型的作用。

J Cardiothorac Vasc Anesth. 2024 May;38(5):1251-1259. doi: 10.1053/j.jvca.2024.01.032. Epub 2024 Feb 1.

Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.三种基于人工智能（AI）的大语言模型在标准化测试中的表现；对人工智能辅助牙科教育的启示。

J Periodontal Res. 2025 Feb;60(2):121-133. doi: 10.1111/jre.13323. Epub 2024 Jul 18.

Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis.幻觉发生率和 ChatGPT 与 Bard 用于系统评价的参考准确性：比较分析。

J Med Internet Res. 2024 May 22;26:e53164. doi: 10.2196/53164.

Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal.ChatGPT 和 Bard 在肾病学委员会更新的自我评估问题中的表现。

Clin Exp Nephrol. 2024 May;28(5):465-469. doi: 10.1007/s10157-023-02451-w. Epub 2024 Feb 14.

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.ChatGPT、GPT-4 和 Bard 在日本放射学会官方董事会考试中的表现评估。

Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study.GPT-3.5、GPT-4和Bard在日本国家牙科医师考试中的表现：一项比较研究。

Cureus. 2023 Dec 12;15(12):e50369. doi: 10.7759/cureus.50369. eCollection 2023 Dec.

引用本文的文献

Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots.评估中年健康问题回答的准确性和可读性：六个大语言模型聊天机器人的比较分析

J Midlife Health. 2025 Jan-Mar;16(1):45-50. doi: 10.4103/jmh.jmh_182_24. Epub 2025 Apr 5.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Generative AI and large language models in nuclear medicine: current status and future prospects.生成式人工智能和核医学中的大语言模型：现状与未来展望。

Ann Nucl Med. 2024 Nov;38(11):853-864. doi: 10.1007/s12149-024-01981-x. Epub 2024 Sep 25.

本文引用的文献

Transformers and large language models in healthcare: A review.医疗保健中的变压器和大型语言模型：综述。

Artif Intell Med. 2024 Aug;154:102900. doi: 10.1016/j.artmed.2024.102900. Epub 2024 Jun 5.

Evolution of publicly available large language models for complex decision-making in breast cancer care.公开可用的大型语言模型在乳腺癌护理中复杂决策方面的发展。

Arch Gynecol Obstet. 2024 Jul;310(1):537-550. doi: 10.1007/s00404-024-07565-4. Epub 2024 May 29.

BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study.BI-RADS 类别分配由 GPT-3.5、GPT-4 和谷歌巴德完成：一项多语言研究。

Radiology. 2024 Apr;311(1):e232133. doi: 10.1148/radiol.232133.

Utilizing large language models in breast cancer management: systematic review.利用大型语言模型进行乳腺癌管理：系统评价。

J Cancer Res Clin Oncol. 2024 Mar 19;150(3):140. doi: 10.1007/s00432-024-05678-6.

Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review.医学领域的大型语言模型：潜力与陷阱：一篇叙事性综述。

Ann Intern Med. 2024 Feb;177(2):210-220. doi: 10.7326/M23-2772. Epub 2024 Jan 30.

Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer?ChatGPT在回答有关头颈癌的问题时准确可靠吗？

Front Oncol. 2023 Dec 1;13:1256459. doi: 10.3389/fonc.2023.1256459. eCollection 2023.

The future landscape of large language models in medicine.医学领域大语言模型的未来前景。

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.比较 ChatGPT 和 GPT-4 在 USMLE 软技能评估中的表现。

Sci Rep. 2023 Oct 1;13(1):16492. doi: 10.1038/s41598-023-43436-9.

Evaluating large language models on a highly-specialized topic, radiation oncology physics.在高度专业化的主题——放射肿瘤物理学上评估大语言模型。

Front Oncol. 2023 Jul 17;13:1219326. doi: 10.3389/fonc.2023.1219326. eCollection 2023.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大型语言模型如何回答乳腺癌测验问题？GPT-3.5、GPT-4 和 Google Gemini 的比较研究。

How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献