Hassanein Fatma E A, El Barbary Ahmed, Hussein Radwa R, Ahmed Yousra, El-Guindy Jylan, Sarhan Susan, Abou-Bakr Asmaa
Oral Medicine, Periodontology, and Oral Diagnosis, Faculty of Dentistry, King Salman International University, El Tur, Egypt.
Oral Medicine and Periodontology, Faculty of Dentistry, Cairo University, Giza, Egypt.
Oral Dis. 2025 Jul 1. doi: 10.1111/odi.70007.
AI models like ChatGPT-4o and DeepSeek-3 show diagnostic promise, but their reliability in complex, image-based oral lesions remains unclear. This study aimed to evaluate and compare the diagnostic accuracy of ChatGPT-4o and DeepSeek-3 despite their differing modalities against oral medicine (OM) experts across varied lesion types and case difficulty levels.
Eighty standardized clinical vignettes derived from real-world oral disease cases, including clinical images/radiographs, were evaluated. Differential diagnoses were generated by ChatGPT-4o, DeepSeek-3, and four board-certified OM specialists, with accuracy assessed at Top-1, Top-3, and Top-5 levels.
OM specialists consistently achieved the highest diagnostic accuracy. However, DeepSeek-3 significantly outperformed ChatGPT-4o at the Top-3 level (p = 0.0153) and showed greater robustness in high-difficulty and inflammatory cases despite its text-only modality. Multimodal imaging enhanced diagnostic accuracy. Regression analysis indicated lesion type and imaging modality as positive predictors, while diagnostic difficulty negatively impacted Top-1 performance.
Remarkably, the text-only DeepSeek-3 model exceeded the diagnostic performance of the multimodal ChatGPT-4o model for complex oral lesions, highlighting its structured reasoning capabilities and reduced hallucination rate. These findings underscore the potential of non-vision LLMs in diagnostic support, emphasizing the critical need for expert oversight in complex scenarios.
像ChatGPT-4o和DeepSeek-3这样的人工智能模型显示出诊断潜力,但它们在基于图像的复杂口腔病变中的可靠性仍不明确。本研究旨在评估和比较ChatGPT-4o和DeepSeek-3在不同病变类型和病例难度水平下,与口腔医学(OM)专家相比的诊断准确性,尽管它们的模式不同。
对80个源自真实世界口腔疾病病例的标准化临床病例进行评估,包括临床图像/放射照片。由ChatGPT-4o、DeepSeek-3和四位获得委员会认证的OM专家生成鉴别诊断,并在Top-1、Top-3和Top-5水平评估准确性。
OM专家始终获得最高的诊断准确性。然而,DeepSeek-3在Top-3水平上显著优于ChatGPT-4o(p = 0.0153),并且尽管其仅为文本模式,但在高难度和炎症性病例中表现出更强的稳健性。多模态成像提高了诊断准确性。回归分析表明病变类型和成像模式是积极预测因素,而诊断难度对Top-1表现有负面影响。
值得注意的是,仅为文本模式的DeepSeek-3模型在复杂口腔病变的诊断性能上超过了多模态的ChatGPT-4o模型,突出了其结构化推理能力和较低的幻觉率。这些发现强调了非视觉语言模型在诊断支持中的潜力,强调了在复杂场景中专家监督的迫切需求。