大型语言模型如何回答乳腺癌测验问题?GPT-3.5、GPT-4 和 Google Gemini 的比较研究。
How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini.
机构信息
Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy.
Imaging Institute of Southern Switzerland (IIMSI), Ente Ospedaliero Cantonale (EOC), Lugano, Switzerland.
出版信息
Radiol Med. 2024 Oct;129(10):1463-1467. doi: 10.1007/s11547-024-01872-1. Epub 2024 Aug 13.
Applications of large language models (LLMs) in the healthcare field have shown promising results in processing and summarizing multidisciplinary information. This study evaluated the ability of three publicly available LLMs (GPT-3.5, GPT-4, and Google Gemini-then called Bard) to answer 60 multiple-choice questions (29 sourced from public databases, 31 newly formulated by experienced breast radiologists) about different aspects of breast cancer care: treatment and prognosis, diagnostic and interventional techniques, imaging interpretation, and pathology. Overall, the rate of correct answers significantly differed among LLMs (p = 0.010): the best performance was achieved by GPT-4 (95%, 57/60) followed by GPT-3.5 (90%, 54/60) and Google Gemini (80%, 48/60). Across all LLMs, no significant differences were observed in the rates of correct replies to questions sourced from public databases and newly formulated ones (p ≥ 0.593). These results highlight the potential benefits of LLMs in breast cancer care, which will need to be further refined through in-context training.
大型语言模型(LLMs)在医疗保健领域的应用已在处理和总结多学科信息方面显示出可喜的结果。本研究评估了三个公开可用的 LLM(GPT-3.5、GPT-4 和谷歌 Gemini-当时称为 Bard)回答 60 个多项选择题(29 个来自公共数据库,31 个由经验丰富的乳腺放射科医生新制定)的能力,这些选择题涉及乳腺癌护理的不同方面:治疗和预后、诊断和介入技术、影像解读和病理学。总体而言,LLM 之间的正确答案率存在显著差异(p=0.010):GPT-4 的表现最佳(95%,57/60),其次是 GPT-3.5(90%,54/60)和谷歌 Gemini(80%,48/60)。在所有 LLM 中,对来自公共数据库和新制定的问题的正确回答率没有观察到显著差异(p≥0.593)。这些结果突出了 LLM 在乳腺癌护理方面的潜在益处,这将需要通过上下文训练进一步改进。