• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较

A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.

作者信息

Wu Xing, Cai Guofei, Guo Bin, Ma Leizi, Shao Siqi, Yu Jun, Zheng Yuchen, Wang Linhong, Yang Fan

机构信息

School of Stomatology, Zhejiang Chinese Medical University, Hangzhou, Zhejiang, China.

Center for Plastic and Reconstructive Surgery, Department of Stomatology, Affiliated People's Hospital, Zhejiang Provincial People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China.

出版信息

BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.

DOI:10.1186/s12903-025-06619-6
PMID:40721763
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12302792/
Abstract

BACKGROUND

Large language models (LLMs) show promise in medicine, but their effectiveness in specialized fields like implant dentistry remains unclear. This study focuses on five recently released LLMs aiming to systematically evaluate their capabilities in clinical implantology scenarios and to investigate their respective strengths and weaknesses thoroughly to guide precise application.

METHODS

A comprehensive multi-dimensional evaluation was conducted using a test set of 40 professional questions (across 8 themes) and 5 complex cases. To ensure response uniformity, all queries were submitted to five LLMs (ChatGPT-o3-mini, DeepSeek-R1, Grok-3, Gemini-2.0-flash-Thinking, and Qwen2.5-max) using a pre-defined prompt. With standardized parameters to ensure a fair comparison, a single response was generated for each query without re-generation. The responses of the five LLMs were scored by three experienced senior experts from five dimensions in two rounds of double-blind. Inter-rater reliability was tested, followed by statistical analyses including Spearman'sρtest, Friedman test, mixed effect model, and principal component analysis.

RESULTS

High inter-rater reliability was confirmed among the three experts (ICC for average measures ranged from 0.685 to 0.814, all P < 0.001). Gemini-2.0-flash-thinking achieved the highest overall performance, with a mean score of 21.9 in professional question answering and 22.2 in case analysis. This was significantly higher than ChatGPT-o3-mini (mean score 19.2) in question responses and Qwen2.5-max (mean score 16.9) in case evaluations. Mixed-effects models showed Gemini-2.0-flash-thinking superiority over ChatGPT-o3-mini, while Qwen2.5-max exhibited a decline in performance. DeepSeek-R1 and Qwen2.5-max also showed positive interaction effects in specific themes (such as Theme3). The PCA results further indicate that Gemini-2.0-flash-thinking demonstrated the best comprehensive ability in both types of tasks, and reveal the existing differences in the performance of various LLMs.

CONCLUSION

This study reveals diverse LLMs differentiated capabilities in dental implantology, recommending context-specific model selection to different clinical scenario, as Gemini-2.0-flash-Thinking demonstrates optimal performance, notably for high-level clinical support.

TRIAL REGISTRATION

The study protocol and the use of clinical case data have been approved by the Medical Ethics Committee of Zhejiang Provincial People's Hospital (Approval No. QT2025050) on March 4th, 2025. Clinical trial number is not applicable.

摘要

背景

大语言模型(LLMs)在医学领域展现出了潜力,但其在种植牙科等专业领域的有效性仍不明确。本研究聚焦于五个最近发布的大语言模型,旨在系统评估它们在临床种植学场景中的能力,并深入探究它们各自的优势和劣势,以指导精准应用。

方法

使用包含40个专业问题(涵盖8个主题)和5个复杂病例的测试集进行全面的多维度评估。为确保回答的一致性,所有问题均使用预定义提示提交给五个大语言模型(ChatGPT-o3-mini、DeepSeek-R1、Grok-3、Gemini-2.0-flash-Thinking和Qwen2.5-max)。通过标准化参数以确保公平比较,每个问题仅生成一个回答,不进行重新生成。五个大语言模型的回答由三位经验丰富的资深专家分两轮进行双盲从五个维度评分。检验了评分者间的可靠性,随后进行了包括Spearman's ρ检验、Friedman检验、混合效应模型和主成分分析在内的统计分析。

结果

三位专家之间的评分者间可靠性较高(平均测量的组内相关系数范围为0.685至0.814,所有P < 0.001)。Gemini-2.0-flash-thinking的整体表现最佳,在专业问题回答中的平均得分为21.9分,在病例分析中的平均得分为22.2分。这在问题回答方面显著高于ChatGPT-o3-mini(平均得分19.2),在病例评估方面显著高于Qwen2.5-max(平均得分16.9)。混合效应模型显示Gemini-2.0-flash-thinking优于ChatGPT-o3-mini,而Qwen2.5-max的表现有所下降。DeepSeek-R1和Qwen2.5-max在特定主题(如主题3)中也显示出积极的交互作用。主成分分析结果进一步表明,Gemini-2.0-flash-thinking在两种类型的任务中都表现出最佳的综合能力,并揭示了各种大语言模型在性能上存在的差异。

结论

本研究揭示了不同大语言模型在种植牙科方面的不同能力,建议根据不同的临床场景选择特定的模型,因为Gemini-2.0-flash-Thinking表现出最佳性能,特别是在提供高级临床支持方面。

试验注册

本研究方案和临床病例数据的使用已于2025年3月4日获得浙江省人民医院医学伦理委员会批准(批准号QT2025050)。临床试验编号不适用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/c10d095a8d72/12903_2025_6619_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/152ea7ea1cee/12903_2025_6619_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/342366841e6a/12903_2025_6619_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/c01744d9bb2f/12903_2025_6619_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/e1132974a562/12903_2025_6619_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/37c7b87177a4/12903_2025_6619_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/e7822f9fc07f/12903_2025_6619_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/c10d095a8d72/12903_2025_6619_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/152ea7ea1cee/12903_2025_6619_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/342366841e6a/12903_2025_6619_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/c01744d9bb2f/12903_2025_6619_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/e1132974a562/12903_2025_6619_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/37c7b87177a4/12903_2025_6619_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/e7822f9fc07f/12903_2025_6619_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f761/12302792/c10d095a8d72/12903_2025_6619_Fig7_HTML.jpg

相似文献

1
A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
2
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
3
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.
4
A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。
BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.
5
How Well Do Different AI Language Models Inform Patients About Radiofrequency Ablation for Varicose Veins?不同的人工智能语言模型在向患者介绍静脉曲张的射频消融治疗方面效果如何?
Cureus. 2025 Jun 22;17(6):e86537. doi: 10.7759/cureus.86537. eCollection 2025 Jun.
6
Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study.各种通用大语言模型在牙种植学临床共识和病例分析中的有效性:一项比较研究。
BMC Med Inform Decis Mak. 2025 Mar 26;25(1):147. doi: 10.1186/s12911-025-02972-2.
7
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
8
Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.错误的恶臭还是潜力的光辉:言语病理学中(不)负责任地使用ChatGPT的挑战。
Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088.
9
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型:回答组织学问题的比较性跨平台评估
Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.
10
Performance of 3 Conversational Generative Artificial Intelligence Models for Computing Maximum Safe Doses of Local Anesthetics: Comparative Analysis.用于计算局部麻醉药最大安全剂量的3种对话式生成人工智能模型的性能:比较分析
JMIR AI. 2025 May 13;4:e66796. doi: 10.2196/66796.

本文引用的文献

1
Can deepseek and ChatGPT be used in the diagnosis of oral pathologies?DeepSeek和ChatGPT能用于口腔病理学诊断吗?
BMC Oral Health. 2025 Apr 25;25(1):638. doi: 10.1186/s12903-025-06034-x.
2
Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning.DeepSeek大语言模型在医学任务和临床推理方面的比较基准测试。
Nat Med. 2025 Apr 23. doi: 10.1038/s41591-025-03726-3.
3
Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study.
各种通用大语言模型在牙种植学临床共识和病例分析中的有效性:一项比较研究。
BMC Med Inform Decis Mak. 2025 Mar 26;25(1):147. doi: 10.1186/s12911-025-02972-2.
4
An active inference strategy for prompting reliable responses from large language models in medical practice.一种用于在医学实践中促使大语言模型给出可靠回答的主动推理策略。
NPJ Digit Med. 2025 Feb 22;8(1):119. doi: 10.1038/s41746-025-01516-2.
5
Regenerative approaches in alveolar bone augmentation for dental implant placement: Techniques, biomaterials, and clinical decision-making: A comprehensive review.用于牙种植体植入的牙槽骨增量的再生方法:技术、生物材料及临床决策:一项综述
J Dent. 2025 Mar;154:105612. doi: 10.1016/j.jdent.2025.105612. Epub 2025 Feb 4.
6
Scientists flock to DeepSeek: how they're using the blockbuster AI model.科学家们纷纷涌向深度搜索:他们如何使用这个重磅人工智能模型。
Nature. 2025 Jan 29. doi: 10.1038/d41586-025-00275-0.
7
Accuracy of ChatGPT 3.5, 4.0, 4o and Gemini in diagnosing oral potentially malignant lesions based on clinical case reports and image recognition.基于临床病例报告和图像识别,ChatGPT 3.5、4.0、4o和Gemini在诊断口腔潜在恶性病变方面的准确性。
Med Oral Patol Oral Cir Bucal. 2025 Mar 1;30(2):e224-e231. doi: 10.4317/medoral.26824.
8
A generalist medical language model for disease diagnosis assistance.用于疾病诊断辅助的通用医学语言模型。
Nat Med. 2025 Mar;31(3):932-942. doi: 10.1038/s41591-024-03416-6. Epub 2025 Jan 8.
9
Performance of the ChatGPT-3.5, ChatGPT-4, and Google Gemini large language models in responding to dental implantology inquiries.ChatGPT-3.5、ChatGPT-4和谷歌Gemini大型语言模型在回答牙种植学相关问题方面的表现。
J Prosthet Dent. 2025 Jan 4. doi: 10.1016/j.prosdent.2024.12.016.
10
Large language models in periodontology: Assessing their performance in clinically relevant questions.牙周病学中的大语言模型:评估它们在临床相关问题中的表现。
J Prosthet Dent. 2024 Nov 18. doi: 10.1016/j.prosdent.2024.10.020.