Department of Operative, Preventive and Pediatric Dentistry, Charité - Universitätsmedizin Berlin, Aßmannshauser Str. 4-6, Berlin, 14197, Germany.
Clin Oral Investig. 2024 Oct 7;28(11):575. doi: 10.1007/s00784-024-05968-w.
The advent of artificial intelligence (AI) and large language model (LLM)-based AI applications (LLMAs) has tremendous implications for our society. This study analyzed the performance of LLMAs on solving restorative dentistry and endodontics (RDE) student assessment questions.
151 questions from a RDE question pool were prepared for prompting using LLMAs from OpenAI (ChatGPT-3.5,-4.0 and -4.0o) and Google (Gemini 1.0). Multiple-choice questions were sorted into four question subcategories, entered into LLMAs and answers recorded for analysis. P-value and chi-square statistical analyses were performed using Python 3.9.16.
The total answer accuracy of ChatGPT-4.0o was the highest, followed by ChatGPT-4.0, Gemini 1.0 and ChatGPT-3.5 (72%, 62%, 44% and 25%, respectively) with significant differences between all LLMAs except GPT-4.0 models. The performance on subcategories direct restorations and caries was the highest, followed by indirect restorations and endodontics.
Overall, there are large performance differences among LLMAs. Only the ChatGPT-4 models achieved a success ratio that could be used with caution to support the dental academic curriculum.
While LLMAs could support clinicians to answer dental field-related questions, this capacity depends strongly on the employed model. The most performant model ChatGPT-4.0o achieved acceptable accuracy rates in some subject sub-categories analyzed.
人工智能 (AI) 和基于大型语言模型 (LLM) 的 AI 应用程序 (LLMAs) 的出现对我们的社会具有重大影响。本研究分析了 LLMAs 在解决修复学和牙髓学 (RDE) 学生评估问题方面的性能。
使用 OpenAI(ChatGPT-3.5、-4.0 和 -4.0o)和 Google(Gemini 1.0)的 LLMAs 为 151 个来自 RDE 题库的问题准备提示。将选择题分为四个问题子类别,输入到 LLMAs 中并记录答案进行分析。使用 Python 3.9.16 进行 P 值和卡方统计分析。
ChatGPT-4.0o 的总答案准确率最高,其次是 ChatGPT-4.0、Gemini 1.0 和 ChatGPT-3.5(分别为 72%、62%、44%和 25%),除了 GPT-4.0 模型外,所有 LLMAs 之间均存在显著差异。直接修复和龋齿的表现最高,其次是间接修复和牙髓学。
总体而言,不同的 LLMAs 之间存在较大的性能差异。只有 ChatGPT-4 模型的成功率可以谨慎使用,以支持牙科学术课程。
虽然 LLMAs 可以支持临床医生回答与牙科领域相关的问题,但这种能力强烈依赖于所使用的模型。表现最出色的模型 ChatGPT-4.0o 在分析的一些学科子类别中达到了可接受的准确率。