Wu Xing, Cai Guofei, Guo Bin, Ma Leizi, Shao Siqi, Yu Jun, Zheng Yuchen, Wang Linhong, Yang Fan
School of Stomatology, Zhejiang Chinese Medical University, Hangzhou, Zhejiang, China.
Center for Plastic and Reconstructive Surgery, Department of Stomatology, Affiliated People's Hospital, Zhejiang Provincial People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China.
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
Large language models (LLMs) show promise in medicine, but their effectiveness in specialized fields like implant dentistry remains unclear. This study focuses on five recently released LLMs aiming to systematically evaluate their capabilities in clinical implantology scenarios and to investigate their respective strengths and weaknesses thoroughly to guide precise application.
A comprehensive multi-dimensional evaluation was conducted using a test set of 40 professional questions (across 8 themes) and 5 complex cases. To ensure response uniformity, all queries were submitted to five LLMs (ChatGPT-o3-mini, DeepSeek-R1, Grok-3, Gemini-2.0-flash-Thinking, and Qwen2.5-max) using a pre-defined prompt. With standardized parameters to ensure a fair comparison, a single response was generated for each query without re-generation. The responses of the five LLMs were scored by three experienced senior experts from five dimensions in two rounds of double-blind. Inter-rater reliability was tested, followed by statistical analyses including Spearman'sρtest, Friedman test, mixed effect model, and principal component analysis.
High inter-rater reliability was confirmed among the three experts (ICC for average measures ranged from 0.685 to 0.814, all P < 0.001). Gemini-2.0-flash-thinking achieved the highest overall performance, with a mean score of 21.9 in professional question answering and 22.2 in case analysis. This was significantly higher than ChatGPT-o3-mini (mean score 19.2) in question responses and Qwen2.5-max (mean score 16.9) in case evaluations. Mixed-effects models showed Gemini-2.0-flash-thinking superiority over ChatGPT-o3-mini, while Qwen2.5-max exhibited a decline in performance. DeepSeek-R1 and Qwen2.5-max also showed positive interaction effects in specific themes (such as Theme3). The PCA results further indicate that Gemini-2.0-flash-thinking demonstrated the best comprehensive ability in both types of tasks, and reveal the existing differences in the performance of various LLMs.
This study reveals diverse LLMs differentiated capabilities in dental implantology, recommending context-specific model selection to different clinical scenario, as Gemini-2.0-flash-Thinking demonstrates optimal performance, notably for high-level clinical support.
The study protocol and the use of clinical case data have been approved by the Medical Ethics Committee of Zhejiang Provincial People's Hospital (Approval No. QT2025050) on March 4th, 2025. Clinical trial number is not applicable.
大语言模型(LLMs)在医学领域展现出了潜力,但其在种植牙科等专业领域的有效性仍不明确。本研究聚焦于五个最近发布的大语言模型,旨在系统评估它们在临床种植学场景中的能力,并深入探究它们各自的优势和劣势,以指导精准应用。
使用包含40个专业问题(涵盖8个主题)和5个复杂病例的测试集进行全面的多维度评估。为确保回答的一致性,所有问题均使用预定义提示提交给五个大语言模型(ChatGPT-o3-mini、DeepSeek-R1、Grok-3、Gemini-2.0-flash-Thinking和Qwen2.5-max)。通过标准化参数以确保公平比较,每个问题仅生成一个回答,不进行重新生成。五个大语言模型的回答由三位经验丰富的资深专家分两轮进行双盲从五个维度评分。检验了评分者间的可靠性,随后进行了包括Spearman's ρ检验、Friedman检验、混合效应模型和主成分分析在内的统计分析。
三位专家之间的评分者间可靠性较高(平均测量的组内相关系数范围为0.685至0.814,所有P < 0.001)。Gemini-2.0-flash-thinking的整体表现最佳,在专业问题回答中的平均得分为21.9分,在病例分析中的平均得分为22.2分。这在问题回答方面显著高于ChatGPT-o3-mini(平均得分19.2),在病例评估方面显著高于Qwen2.5-max(平均得分16.9)。混合效应模型显示Gemini-2.0-flash-thinking优于ChatGPT-o3-mini,而Qwen2.5-max的表现有所下降。DeepSeek-R1和Qwen2.5-max在特定主题(如主题3)中也显示出积极的交互作用。主成分分析结果进一步表明,Gemini-2.0-flash-thinking在两种类型的任务中都表现出最佳的综合能力,并揭示了各种大语言模型在性能上存在的差异。
本研究揭示了不同大语言模型在种植牙科方面的不同能力,建议根据不同的临床场景选择特定的模型,因为Gemini-2.0-flash-Thinking表现出最佳性能,特别是在提供高级临床支持方面。
本研究方案和临床病例数据的使用已于2025年3月4日获得浙江省人民医院医学伦理委员会批准(批准号QT2025050)。临床试验编号不适用。