CT和MRI检查前用于患者教育的多种先进大语言模型的比较

Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations.

作者信息

Eminovic Semil, Levita Bogdan, Dell'Orco Andrea, Leppig Jonas Alexander, Nawabi Jawed, Penzkofer Tobias

机构信息

Department of Radiology, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353 Berlin, Germany.

Department of Neuroradiology, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353 Berlin, Germany.

出版信息

J Pers Med. 2025 Jun 5;15(6):235. doi: 10.3390/jpm15060235.

DOI:10.3390/jpm15060235

PMID:40559098

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12194482/

Abstract

: This study compares the accuracy of responses from state-of-the-art large language models (LLMs) to patient questions before CT and MRI imaging. We aim to demonstrate the potential of LLMs in improving workflow efficiency, while also highlighting risks such as misinformation. : There were 57 CT-related and 64 MRI-related patient questions displayed to ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini, and Mistral Large 2. Each answer was evaluated by two board-certified radiologists and scored for accuracy/correctness/likelihood to mislead using a 5-point Likert scale. Statistics compared LLM performance across question categories. : ChatGPT-4o achieved the highest average scores for CT-related questions and tied with Claude 3.5 Sonnet for MRI-related questions, with higher scores across all models for MRI (ChatGPT-4o: CT [4.52 (± 0.46)], MRI: [4.79 (± 0.37)]; Google Gemini: CT [4.44 (± 0.58)]; MRI [4.68 (± 0.58)]; Claude 3.5 Sonnet: CT [4.40 (± 0.59)]; MRI [4.79 (± 0.37)]; Mistral Large 2: CT [4.25 (± 0.54)]; MRI [4.74 (± 0.47)]). At least one response per LLM was rated as inaccurate, with Google Gemini answering most often potentially misleading (in 5.26% for CT and 2.34% for MRI). Mistral Large 2 was outperformed by ChatGPT-4o for all CT-related questions ( < 0.001) and by ChatGPT-4o ( = 0.003), Google Gemini ( = 0.022), and Claude 3.5 Sonnet ( = 0.004) for all CT Contrast media information questions. : Even though all LLMs performed well overall and showed great potential for patient education, each model occasionally displayed potentially misleading information, highlighting the clinical application risk.

摘要

本研究比较了最先进的大语言模型（LLMs）对CT和MRI成像前患者问题的回答准确性。我们旨在证明大语言模型在提高工作流程效率方面的潜力，同时也强调错误信息等风险。向ChatGPT-4o、Claude 3.5 Sonnet、谷歌Gemini和米斯特拉尔大模型2展示了57个与CT相关和64个与MRI相关的患者问题。每个答案由两名获得委员会认证的放射科医生进行评估，并使用5点李克特量表对准确性/正确性/误导可能性进行评分。统计数据比较了各问题类别中大语言模型的表现。ChatGPT-4o在与CT相关的问题上获得了最高平均分，在与MRI相关的问题上与Claude 3.5 Sonnet并列，在所有模型中MRI的得分更高（ChatGPT-4o：CT[4.52（±0.46）]，MRI：[4.79（±0.37）]；谷歌Gemini：CT[4.44（±0.58）]；MRI[4.68（±0.58）]；Claude 3.5 Sonnet：CT[4.40（±0.59）]；MRI[4.79（±0.37）]；米斯特拉尔大模型2：CT[4.25（±0.54）]；MRI[4.74（±0.47）]）。每个大语言模型至少有一个回答被评为不准确，谷歌Gemini回答最常具有潜在误导性（CT为5.26%，MRI为2.34%）。在所有与CT相关的问题上，米斯特拉尔大模型2的表现均不如ChatGPT-4o（<0.001），在所有CT造影剂信息问题上，米斯特拉尔大模型2不如ChatGPT-4o（=0.003）、谷歌Gemini（=0.022）和Claude 3.5 Sonnet（=0.004）。尽管所有大语言模型总体表现良好，在患者教育方面显示出巨大潜力，但每个模型偶尔都会显示出潜在的误导性信息，突出了临床应用风险。