• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT-4o和DeepSeek-3对复杂口腔病变的鉴别诊断性能:多模态成像与病例难度分析

Diagnostic Performance of ChatGPT-4o and DeepSeek-3 Differential Diagnosis of Complex Oral Lesions: A Multimodal Imaging and Case Difficulty Analysis.

作者信息

Hassanein Fatma E A, El Barbary Ahmed, Hussein Radwa R, Ahmed Yousra, El-Guindy Jylan, Sarhan Susan, Abou-Bakr Asmaa

机构信息

Oral Medicine, Periodontology, and Oral Diagnosis, Faculty of Dentistry, King Salman International University, El Tur, Egypt.

Oral Medicine and Periodontology, Faculty of Dentistry, Cairo University, Giza, Egypt.

出版信息

Oral Dis. 2025 Jul 1. doi: 10.1111/odi.70007.

DOI:10.1111/odi.70007
PMID:40589366
Abstract

BACKGROUND

AI models like ChatGPT-4o and DeepSeek-3 show diagnostic promise, but their reliability in complex, image-based oral lesions remains unclear. This study aimed to evaluate and compare the diagnostic accuracy of ChatGPT-4o and DeepSeek-3 despite their differing modalities against oral medicine (OM) experts across varied lesion types and case difficulty levels.

METHODS

Eighty standardized clinical vignettes derived from real-world oral disease cases, including clinical images/radiographs, were evaluated. Differential diagnoses were generated by ChatGPT-4o, DeepSeek-3, and four board-certified OM specialists, with accuracy assessed at Top-1, Top-3, and Top-5 levels.

RESULTS

OM specialists consistently achieved the highest diagnostic accuracy. However, DeepSeek-3 significantly outperformed ChatGPT-4o at the Top-3 level (p = 0.0153) and showed greater robustness in high-difficulty and inflammatory cases despite its text-only modality. Multimodal imaging enhanced diagnostic accuracy. Regression analysis indicated lesion type and imaging modality as positive predictors, while diagnostic difficulty negatively impacted Top-1 performance.

CONCLUSIONS

Remarkably, the text-only DeepSeek-3 model exceeded the diagnostic performance of the multimodal ChatGPT-4o model for complex oral lesions, highlighting its structured reasoning capabilities and reduced hallucination rate. These findings underscore the potential of non-vision LLMs in diagnostic support, emphasizing the critical need for expert oversight in complex scenarios.

摘要

背景

像ChatGPT-4o和DeepSeek-3这样的人工智能模型显示出诊断潜力,但它们在基于图像的复杂口腔病变中的可靠性仍不明确。本研究旨在评估和比较ChatGPT-4o和DeepSeek-3在不同病变类型和病例难度水平下,与口腔医学(OM)专家相比的诊断准确性,尽管它们的模式不同。

方法

对80个源自真实世界口腔疾病病例的标准化临床病例进行评估,包括临床图像/放射照片。由ChatGPT-4o、DeepSeek-3和四位获得委员会认证的OM专家生成鉴别诊断,并在Top-1、Top-3和Top-5水平评估准确性。

结果

OM专家始终获得最高的诊断准确性。然而,DeepSeek-3在Top-3水平上显著优于ChatGPT-4o(p = 0.0153),并且尽管其仅为文本模式,但在高难度和炎症性病例中表现出更强的稳健性。多模态成像提高了诊断准确性。回归分析表明病变类型和成像模式是积极预测因素,而诊断难度对Top-1表现有负面影响。

结论

值得注意的是,仅为文本模式的DeepSeek-3模型在复杂口腔病变的诊断性能上超过了多模态的ChatGPT-4o模型,突出了其结构化推理能力和较低的幻觉率。这些发现强调了非视觉语言模型在诊断支持中的潜力,强调了在复杂场景中专家监督的迫切需求。

相似文献

1
Diagnostic Performance of ChatGPT-4o and DeepSeek-3 Differential Diagnosis of Complex Oral Lesions: A Multimodal Imaging and Case Difficulty Analysis.ChatGPT-4o和DeepSeek-3对复杂口腔病变的鉴别诊断性能:多模态成像与病例难度分析
Oral Dis. 2025 Jul 1. doi: 10.1111/odi.70007.
2
Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines.评估ChatGPT和DeepSeek在硬膜穿刺后头痛管理中的应用:与国际共识指南的对比研究
BMC Neurol. 2025 Jul 1;25(1):264. doi: 10.1186/s12883-025-04280-8.
3
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
4
Diagnostic Performance of ChatGPT-4o in Detecting Hip Fractures on Pelvic X-rays.ChatGPT-4o在骨盆X光片检测髋部骨折中的诊断性能
Cureus. 2025 Jun 24;17(6):e86654. doi: 10.7759/cureus.86654. eCollection 2025 Jun.
5
Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors.人工智能医生的临床可行性:评估大语言模型在中枢神经系统肿瘤门诊环境中的替代潜力。
Int J Med Inform. 2025 Jun 12;203:106013. doi: 10.1016/j.ijmedinf.2025.106013.
6
Performance of ChatGPT-4o in the diagnostic workup of fever among returning travellers requiring hospitalization: a validation study.ChatGPT-4o在需要住院治疗的归国旅行者发热诊断检查中的表现:一项验证研究。
J Travel Med. 2025 Apr 25;32(4). doi: 10.1093/jtm/taaf005.
7
A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。
BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.
8
Evaluating the Accuracy and Performance of ChatGPT-4o in Solving Japanese National Dental Technician Examination.评估ChatGPT-4o在解决日本国家牙科技师考试问题中的准确性和性能。
Int Dent J. 2025 Jun 9;75(4):100847. doi: 10.1016/j.identj.2025.100847.
9
Diagnostic Performance of Multimodal Large Language Models in the Analysis of Oral Pathology.
Oral Dis. 2025 Jun 22. doi: 10.1111/odi.70009.
10
Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models.使用QUADAS-2对大型语言模型进行诊断准确性研究的偏倚风险评估
Diagnostics (Basel). 2025 Jun 6;15(12):1451. doi: 10.3390/diagnostics15121451.