Suppr超能文献

牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较

A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.

作者信息

Wu Xing, Cai Guofei, Guo Bin, Ma Leizi, Shao Siqi, Yu Jun, Zheng Yuchen, Wang Linhong, Yang Fan

机构信息

School of Stomatology, Zhejiang Chinese Medical University, Hangzhou, Zhejiang, China.

Center for Plastic and Reconstructive Surgery, Department of Stomatology, Affiliated People's Hospital, Zhejiang Provincial People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China.

出版信息

BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.

Abstract

BACKGROUND

Large language models (LLMs) show promise in medicine, but their effectiveness in specialized fields like implant dentistry remains unclear. This study focuses on five recently released LLMs aiming to systematically evaluate their capabilities in clinical implantology scenarios and to investigate their respective strengths and weaknesses thoroughly to guide precise application.

METHODS

A comprehensive multi-dimensional evaluation was conducted using a test set of 40 professional questions (across 8 themes) and 5 complex cases. To ensure response uniformity, all queries were submitted to five LLMs (ChatGPT-o3-mini, DeepSeek-R1, Grok-3, Gemini-2.0-flash-Thinking, and Qwen2.5-max) using a pre-defined prompt. With standardized parameters to ensure a fair comparison, a single response was generated for each query without re-generation. The responses of the five LLMs were scored by three experienced senior experts from five dimensions in two rounds of double-blind. Inter-rater reliability was tested, followed by statistical analyses including Spearman'sρtest, Friedman test, mixed effect model, and principal component analysis.

RESULTS

High inter-rater reliability was confirmed among the three experts (ICC for average measures ranged from 0.685 to 0.814, all P < 0.001). Gemini-2.0-flash-thinking achieved the highest overall performance, with a mean score of 21.9 in professional question answering and 22.2 in case analysis. This was significantly higher than ChatGPT-o3-mini (mean score 19.2) in question responses and Qwen2.5-max (mean score 16.9) in case evaluations. Mixed-effects models showed Gemini-2.0-flash-thinking superiority over ChatGPT-o3-mini, while Qwen2.5-max exhibited a decline in performance. DeepSeek-R1 and Qwen2.5-max also showed positive interaction effects in specific themes (such as Theme3). The PCA results further indicate that Gemini-2.0-flash-thinking demonstrated the best comprehensive ability in both types of tasks, and reveal the existing differences in the performance of various LLMs.

CONCLUSION

This study reveals diverse LLMs differentiated capabilities in dental implantology, recommending context-specific model selection to different clinical scenario, as Gemini-2.0-flash-Thinking demonstrates optimal performance, notably for high-level clinical support.

TRIAL REGISTRATION

The study protocol and the use of clinical case data have been approved by the Medical Ethics Committee of Zhejiang Provincial People's Hospital (Approval No. QT2025050) on March 4th, 2025. Clinical trial number is not applicable.

摘要

背景

大语言模型(LLMs)在医学领域展现出了潜力,但其在种植牙科等专业领域的有效性仍不明确。本研究聚焦于五个最近发布的大语言模型,旨在系统评估它们在临床种植学场景中的能力,并深入探究它们各自的优势和劣势,以指导精准应用。

方法

使用包含40个专业问题(涵盖8个主题)和5个复杂病例的测试集进行全面的多维度评估。为确保回答的一致性,所有问题均使用预定义提示提交给五个大语言模型(ChatGPT-o3-mini、DeepSeek-R1、Grok-3、Gemini-2.0-flash-Thinking和Qwen2.5-max)。通过标准化参数以确保公平比较,每个问题仅生成一个回答,不进行重新生成。五个大语言模型的回答由三位经验丰富的资深专家分两轮进行双盲从五个维度评分。检验了评分者间的可靠性,随后进行了包括Spearman's ρ检验、Friedman检验、混合效应模型和主成分分析在内的统计分析。

结果

三位专家之间的评分者间可靠性较高(平均测量的组内相关系数范围为0.685至0.814,所有P < 0.001)。Gemini-2.0-flash-thinking的整体表现最佳,在专业问题回答中的平均得分为21.9分,在病例分析中的平均得分为22.2分。这在问题回答方面显著高于ChatGPT-o3-mini(平均得分19.2),在病例评估方面显著高于Qwen2.5-max(平均得分16.9)。混合效应模型显示Gemini-2.0-flash-thinking优于ChatGPT-o3-mini,而Qwen2.5-max的表现有所下降。DeepSeek-R1和Qwen2.5-max在特定主题(如主题3)中也显示出积极的交互作用。主成分分析结果进一步表明,Gemini-2.0-flash-thinking在两种类型的任务中都表现出最佳的综合能力,并揭示了各种大语言模型在性能上存在的差异。

结论

本研究揭示了不同大语言模型在种植牙科方面的不同能力,建议根据不同的临床场景选择特定的模型,因为Gemini-2.0-flash-Thinking表现出最佳性能,特别是在提供高级临床支持方面。

试验注册

本研究方案和临床病例数据的使用已于2025年3月4日获得浙江省人民医院医学伦理委员会批准(批准号QT2025050)。临床试验编号不适用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/152ea7ea1cee/12903_2025_6619_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验