Urda-Cîmpean Andrada Elena, Leucuța Daniel-Corneliu, Drugan Cristina, Duțu Alina-Gabriela, Călinici Tudor, Drugan Tudor
Department of Medical Informatics and Biostatistics, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania.
Department of Medical Biochemistry, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania.
Diagnostics (Basel). 2025 Jun 29;15(13):1657. doi: 10.3390/diagnostics15131657.
In recent years, numerous artificial intelligence applications, especially generative large language models, have evolved in the medical field. This study conducted a structured comparative analysis of four leading generative large language models (LLMs)-ChatGPT-4o (OpenAI), Grok-3 (xAI), Gemini-2.0 Flash (Google), and DeepSeek-V3 (DeepSeek)-to evaluate their diagnostic performance in clinical case scenarios. We assessed medical knowledge recall and clinical reasoning capabilities through staged, progressively complex cases, with responses graded by expert raters using a 0-5 scale. All models performed better on knowledge-based questions than on reasoning tasks, highlighting the ongoing limitations in contextual diagnostic synthesis. Overall, DeepSeek outperformed the other models, achieving significantly higher scores across all evaluation dimensions ( < 0.05), particularly in regards to medical reasoning tasks. While these findings support the feasibility of using LLMs for medical training and decision support, the study emphasizes the need for improved interpretability, prompt optimization, and rigorous benchmarking to ensure clinical reliability. This structured, comparative approach contributes to ongoing efforts to establish standardized evaluation frameworks for integrating LLMs into diagnostic workflows.
近年来,众多人工智能应用,尤其是生成式大语言模型,已在医学领域得到发展。本研究对四种领先的生成式大语言模型(LLMs)——ChatGPT-4o(OpenAI)、Grok-3(xAI)、Gemini-2.0 Flash(谷歌)和DeepSeek-V3(DeepSeek)——进行了结构化比较分析,以评估它们在临床病例场景中的诊断性能。我们通过分阶段、逐步复杂的病例评估医学知识回忆和临床推理能力,专家评分者使用0至5分制对回答进行评分。所有模型在基于知识的问题上的表现均优于推理任务,凸显了情境诊断综合方面仍存在的局限性。总体而言,DeepSeek的表现优于其他模型,在所有评估维度上均取得了显著更高的分数(<0.05),尤其是在医学推理任务方面。虽然这些发现支持了使用大语言模型进行医学培训和决策支持的可行性,但该研究强调需要改进可解释性、优化提示并进行严格的基准测试,以确保临床可靠性。这种结构化的比较方法有助于为将大语言模型整合到诊断工作流程中建立标准化评估框架的持续努力。