Comparative Analysis of ChatGPT-3.5 and GPT-4 in Open-Ended Clinical Reasoning Across Dental Specialties.

作者信息

Babaee Hemmati Yasamin, Rasouli Morteza, Falahchai Mehran

机构信息

Department of Orthodontics, Dental Sciences Research Center, School of Dentistry, Guilan University of Medical Sciences, Rasht, Iran.

School of Dentistry, Guilan University of Medical Sciences, Rasht, Iran.

出版信息

Eur J Dent Educ. 2025 Jun 13. doi: 10.1111/eje.13144.

DOI:10.1111/eje.13144

PMID:40515430

Abstract

PURPOSE

The integration of large language models (LLMs) such as ChatGPT into health care has garnered increasing interest. While previous studies have assessed these models using structured multiple-choice questions, limited research has evaluated their performance on open-ended, scenario-based clinical tasks, particularly in dentistry. This study aimed to evaluate and compare the clinical reasoning capabilities of ChatGPT-3.5 and GPT-4 in formulating treatment plans across seven dental specialties using realistic, open-ended clinical scenarios.

METHODS

A cross-sectional analytical study, reported in accordance with the STROBE guidelines, was conducted using 70 dental cases spanning endodontics, oral and maxillofacial surgery, oral medicine, orthodontics, paediatric dentistry, periodontology, and radiology. Each case was submitted to both ChatGPT-3.5 and GPT-4 (paid version, November 2024). Responses were evaluated by specialty-specific expert panels using a three-level rubric (poor, average, good). Statistical analyses included chi-square tests and Fisher-Freeman-Halton exact tests (α = 0.05).

RESULTS

GPT-4 significantly outperformed GPT-3.5 in overall response quality (67.1% vs. 44.3% rated as 'good'; p = 0.016). Although no significant differences were observed across most specialties, GPT-4 showed a statistically superior performance in oral and maxillofacial surgery. Its advantage was more pronounced in complex cases, aligning with the model's enhanced contextual reasoning.

CONCLUSION

GPT-4 demonstrated superior accuracy and consistency compared to GPT-3.5, particularly in clinically complex and integrative tasks. These findings support the potential of advanced LLMs as adjunct tools in dental education and decision-making, though specialty-specific applications and expert oversight remain essential.

摘要