Taşyürek Makbule, Adıgüzel Özkan, Ortaç Hatice
Department of Endodontics, Faculty of Dentistry, Dicle University, 21280 Diyarbakır, Türkiye.
Department of Biostatistics, Faculty of Medicine, Dicle University, 21280 Diyarbakır, Türkiye.
Healthcare (Basel). 2025 Oct 17;13(20):2615. doi: 10.3390/healthcare13202615.
The aim of this study was to compare four recently introduced LLMs (ChatGPT-5, Grok 4, Gemini 2.5 Flash, and Claude Sonnet-4). Experienced endodontists evaluated the accuracy, completeness, and readability of the responses given to open-ended questions about iatrogenic events in endodontics. Twenty-five open-ended questions related to iatrogenic events in endodontics were prepared. The responses of the four LLMs were evaluated by two specialist endodontists using a Likert scale for accuracy and completeness, and the Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simplified Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI) for readability. The accuracy score of ChatGPT-5's responses to open-ended questions (4.56 ± 0.65) was found to be significantly higher than those of Gemini 2.5 Flash (3.64 ± 0.95) and Claude Sonnet-4 (3.44 ± 1.19) ( = 0.009, and = 0.002, respectively). Similarly, the completeness score of ChatGPT-5 (2.88 ± 0.33) was higher than those of Claude Sonnet-4, Gemini 2.5 Flash, and Grok 4 ( < 0.001, = 0.002, and = 0.007, respectively). In terms of readability measures, ChatGPT-5 and Gemini 2.5 Flash achieved better FRESs than Claude Sonnet-4 ( = 0.003, and < 0.001, respectively). Conversely, FKGL scores were higher for Claude Sonnet-4 and Grok 4 compared to ChatGPT-5 ( < 0.001, and = 0.008, respectively). Correlation analyses revealed a strong positive association (r = 0.77; < 0.001) between accuracy and completeness, a weak negative correlation (r = -0.19; = 0.047) between completeness and FKGL, and a strong negative correlation between (r = -0.88; < 0.001) FKGL and FRES. Additionally, ChatGPT-5 demonstrated lower GFI and CLI scores than the other models, while its SMOG scores were lower than those of Gemini 2.5 Flash and Grok 4 ( = 0.001, and < 0.001, respectively). Although differences were observed between the LLMs in terms of the accuracy and completeness of the responses, ChatGPT-5 showed the best performance. Even with high scores of accuracy (excellent) and completeness (comprehensive), it must not be forgotten that incorrect information can lead to serious outcomes in healthcare services. Therefore, the readability of responses is of critical importance, and when selecting a model, readability should be evaluated together with content quality.
本研究的目的是比较四种最近推出的大型语言模型(ChatGPT-5、Grok 4、Gemini 2.5 Flash和Claude Sonnet-4)。经验丰富的牙髓病专家评估了针对牙髓病医源性事件的开放式问题所给出回答的准确性、完整性和可读性。准备了25个与牙髓病医源性事件相关的开放式问题。两位牙髓病专科医生使用李克特量表评估了这四种大型语言模型回答的准确性和完整性,并使用弗莱什易读性分数(FRES)、弗莱什-金凯德年级水平(FKGL)、冈宁雾度指数(GFI)、简化的官样文章度量(SMOG)和科尔曼-廖指数(CLI)评估了回答的可读性。结果发现,ChatGPT-5对开放式问题回答的准确性得分(4.56±0.65)显著高于Gemini 2.5 Flash(3.64±0.95)和Claude Sonnet-4(3.44±1.19)(分别为P = 0.009和P = 0.002)。同样,ChatGPT-5的完整性得分(2.88±0.33)高于Claude Sonnet-4、Gemini 2.5 Flash和Grok 4(分别为P < 0.001、P = 0.002和P = 0.007)。在可读性度量方面,ChatGPT-5和Gemini 2.5 Flash的FRES得分优于Claude Sonnet-4(分别为P = 0.003和P < 0.001)。相反,Claude Sonnet-4和Grok 4的FKGL得分高于ChatGPT-5(分别为P < 0.001和P = 0.008)。相关性分析显示,准确性和完整性之间存在强正相关(r = 0.77;P < 0.001),完整性与FKGL之间存在弱负相关(r = -0.19;P = 0.047),FKGL与FRES之间存在强负相关(r = -0.88;P < 0.001)。此外,ChatGPT-5的GFI和CLI得分低于其他模型,而其SMOG得分低于Gemini 2.5 Flash和Grok 4(分别为P = 0.001和P < 0.001)。尽管在回答的准确性和完整性方面观察到了大型语言模型之间的差异,但ChatGPT-5表现最佳。即使准确性(优秀)和完整性(全面)得分很高,也绝不能忘记,错误信息在医疗服务中可能导致严重后果。因此,回答的可读性至关重要,在选择模型时,应将可读性与内容质量一起评估。