Suppr超能文献

ChatGPT-4o和DeepSeek-3对复杂口腔病变的鉴别诊断性能:多模态成像与病例难度分析

Diagnostic Performance of ChatGPT-4o and DeepSeek-3 Differential Diagnosis of Complex Oral Lesions: A Multimodal Imaging and Case Difficulty Analysis.

作者信息

Hassanein Fatma E A, El Barbary Ahmed, Hussein Radwa R, Ahmed Yousra, El-Guindy Jylan, Sarhan Susan, Abou-Bakr Asmaa

机构信息

Oral Medicine, Periodontology, and Oral Diagnosis, Faculty of Dentistry, King Salman International University, El Tur, Egypt.

Oral Medicine and Periodontology, Faculty of Dentistry, Cairo University, Giza, Egypt.

出版信息

Oral Dis. 2025 Jul 1. doi: 10.1111/odi.70007.

Abstract

BACKGROUND

AI models like ChatGPT-4o and DeepSeek-3 show diagnostic promise, but their reliability in complex, image-based oral lesions remains unclear. This study aimed to evaluate and compare the diagnostic accuracy of ChatGPT-4o and DeepSeek-3 despite their differing modalities against oral medicine (OM) experts across varied lesion types and case difficulty levels.

METHODS

Eighty standardized clinical vignettes derived from real-world oral disease cases, including clinical images/radiographs, were evaluated. Differential diagnoses were generated by ChatGPT-4o, DeepSeek-3, and four board-certified OM specialists, with accuracy assessed at Top-1, Top-3, and Top-5 levels.

RESULTS

OM specialists consistently achieved the highest diagnostic accuracy. However, DeepSeek-3 significantly outperformed ChatGPT-4o at the Top-3 level (p = 0.0153) and showed greater robustness in high-difficulty and inflammatory cases despite its text-only modality. Multimodal imaging enhanced diagnostic accuracy. Regression analysis indicated lesion type and imaging modality as positive predictors, while diagnostic difficulty negatively impacted Top-1 performance.

CONCLUSIONS

Remarkably, the text-only DeepSeek-3 model exceeded the diagnostic performance of the multimodal ChatGPT-4o model for complex oral lesions, highlighting its structured reasoning capabilities and reduced hallucination rate. These findings underscore the potential of non-vision LLMs in diagnostic support, emphasizing the critical need for expert oversight in complex scenarios.

摘要

背景

像ChatGPT-4o和DeepSeek-3这样的人工智能模型显示出诊断潜力,但它们在基于图像的复杂口腔病变中的可靠性仍不明确。本研究旨在评估和比较ChatGPT-4o和DeepSeek-3在不同病变类型和病例难度水平下,与口腔医学(OM)专家相比的诊断准确性,尽管它们的模式不同。

方法

对80个源自真实世界口腔疾病病例的标准化临床病例进行评估,包括临床图像/放射照片。由ChatGPT-4o、DeepSeek-3和四位获得委员会认证的OM专家生成鉴别诊断,并在Top-1、Top-3和Top-5水平评估准确性。

结果

OM专家始终获得最高的诊断准确性。然而,DeepSeek-3在Top-3水平上显著优于ChatGPT-4o(p = 0.0153),并且尽管其仅为文本模式,但在高难度和炎症性病例中表现出更强的稳健性。多模态成像提高了诊断准确性。回归分析表明病变类型和成像模式是积极预测因素,而诊断难度对Top-1表现有负面影响。

结论

值得注意的是,仅为文本模式的DeepSeek-3模型在复杂口腔病变的诊断性能上超过了多模态的ChatGPT-4o模型,突出了其结构化推理能力和较低的幻觉率。这些发现强调了非视觉语言模型在诊断支持中的潜力,强调了在复杂场景中专家监督的迫切需求。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验