比较ChatGPT 4o、DeepSeek R1和Gemini 2 Pro在不同时间回答固定义齿修复问题方面的表现。

Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time.

作者信息

Shirani Mohammadjavad

机构信息

Assistant Professor, Department of Restorative Dentistry, Maurice H. Kornberg School of Dentistry, Temple University, Philadelphia, PA, United States.

出版信息

J Prosthet Dent. 2025 May 22. doi: 10.1016/j.prosdent.2025.04.038.

STATEMENT OF PROBLEM

The accuracy of DeepSeek and the latest versions of ChatGPT and Gemini in responding to prosthodontics questions needs to be evaluated. Additionally, the extent to which the performance of these chatbots changes through user interactions remains unexplored.

PURPOSE

The purpose of this longitudinal repeated-measures experimental study was to compare the performance of ChatGPT (4o), DeepSeek (R1), and Gemini (2 Pro) in answering multiple-choice (MC) and short-answer (SA) fixed prosthodontics questions over 4 consecutive weeks after exposure to correct responses.

MATERIAL AND METHODS

A total of 40 questions (20 MC and 20 SA) were developed based on the sixth edition of Contemporary Fixed Prosthodontics. Following a standardized protocol, these questions were posed to ChatGPT, DeepSeek, and Gemini on 4 consecutive Saturdays using 10 independent accounts per chatbot. After each session, correct answers were provided to the chatbots, and, before the next session, their memory and history were cleared. Responses were scored as correct (1) or incorrect (0) for MC questions and correct (2), partially correct (1), or incorrect (0) for SA questions. Weighted accuracy was calculated accordingly. The Kendall W coefficient was used to assess agreement among the 10 accounts per chatbot. The effects of chatbot type, time (week), and their interaction on performance were analyzed using generalized estimating equations (GEEs), followed by pairwise comparisons using the Mann-Whitney U test and Wilcoxon signed-rank test with Bonferroni adjustments for multiple comparisons (α=.05).

RESULTS

All chatbots showed significant reproducibility, with Gemini exhibiting the highest repeatability for SA questions, followed by ChatGPT for MC questions. Accuracy ranged between 43% and 71%. ChatGPT and DeepSeek demonstrated significantly better performance in MC questions compared with Gemini (P<.017). However, in the third week, Gemini outperformed DeepSeek in SA questions (P=.007). Over time, Gemini showed continuous improvement in SA questions, whereas DeepSeek exhibited a performance surge in the fourth week. ChatGPT's performance remained stable throughout the study period.

CONCLUSIONS

The overall accuracy of the studied chatbots in answering MC and SA prosthodontics questions was not satisfactory. Among them, ChatGPT was the most reliable for MC questions, while ChatGPT and Gemini performed best for SA questions. Gemini (for SA questions) and DeepSeek (for MC and SA questions) demonstrated improvement after exposure to correct responses.

问题陈述

需要评估DeepSeek以及最新版本的ChatGPT和Gemini在回答口腔修复学问题时的准确性。此外，这些聊天机器人的性能在用户交互过程中的变化程度仍未得到探索。

目的

这项纵向重复测量实验研究的目的是比较ChatGPT（4o）、DeepSeek（R1）和Gemini（2 Pro）在接触正确答案后的连续4周内回答多项选择题（MC）和简答题（SA）固定义齿修复问题的性能。

材料与方法

基于《当代固定义齿修复学》第六版编写了总共40个问题（20个MC题和20个SA题）。按照标准化方案，在连续4个周六使用每个聊天机器人的10个独立账户向ChatGPT、DeepSeek和Gemini提出这些问题。每次会话后，向聊天机器人提供正确答案，并且在下一次会话之前清除它们的记忆和历史记录。对于MC问题，回答被评为正确（1）或错误（0）；对于SA问题，回答被评为正确（2）、部分正确（1）或错误（0）。相应地计算加权准确率。使用肯德尔W系数评估每个聊天机器人的10个账户之间的一致性。使用广义估计方程（GEE）分析聊天机器人类型、时间（周）及其交互对性能的影响，随后使用曼-惠特尼U检验和威尔科克森符号秩检验进行成对比较，并采用Bonferroni校正进行多重比较（α = 0.05）。

结果

所有聊天机器人都表现出显著的可重复性，Gemini在SA问题上表现出最高的重复性，其次是ChatGPT在MC问题上。准确率在43%至71%之间。与Gemini相比，ChatGPT和DeepSeek在MC问题上表现出显著更好的性能（P < 0.017）。然而，在第三周，Gemini在SA问题上的表现优于DeepSeek（P = 0.007）。随着时间的推移，Gemini在SA问题上持续改进，而DeepSeek在第四周表现出性能激增。ChatGPT的性能在整个研究期间保持稳定。

结论

所研究的聊天机器人在回答MC和SA口腔修复学问题时的总体准确性并不令人满意。其中，ChatGPT在MC问题上最可靠，而ChatGPT和Gemini在SA问题上表现最佳。Gemini（对于SA问题）和DeepSeek（对于MC和SA问题）在接触正确答案后表现出改进。