大型语言人工智能模型在解决修复牙科和牙髓学生评估方面的性能。

Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments.

机构信息

Department of Operative, Preventive and Pediatric Dentistry, Charité - Universitätsmedizin Berlin, Aßmannshauser Str. 4-6, Berlin, 14197, Germany.

出版信息

Clin Oral Investig. 2024 Oct 7;28(11):575. doi: 10.1007/s00784-024-05968-w.

DOI:10.1007/s00784-024-05968-w

PMID:39373739

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11458639/

Abstract

OBJECTIVES

The advent of artificial intelligence (AI) and large language model (LLM)-based AI applications (LLMAs) has tremendous implications for our society. This study analyzed the performance of LLMAs on solving restorative dentistry and endodontics (RDE) student assessment questions.

MATERIALS AND METHODS

151 questions from a RDE question pool were prepared for prompting using LLMAs from OpenAI (ChatGPT-3.5,-4.0 and -4.0o) and Google (Gemini 1.0). Multiple-choice questions were sorted into four question subcategories, entered into LLMAs and answers recorded for analysis. P-value and chi-square statistical analyses were performed using Python 3.9.16.

RESULTS

The total answer accuracy of ChatGPT-4.0o was the highest, followed by ChatGPT-4.0, Gemini 1.0 and ChatGPT-3.5 (72%, 62%, 44% and 25%, respectively) with significant differences between all LLMAs except GPT-4.0 models. The performance on subcategories direct restorations and caries was the highest, followed by indirect restorations and endodontics.

CONCLUSIONS

Overall, there are large performance differences among LLMAs. Only the ChatGPT-4 models achieved a success ratio that could be used with caution to support the dental academic curriculum.

CLINICAL RELEVANCE

While LLMAs could support clinicians to answer dental field-related questions, this capacity depends strongly on the employed model. The most performant model ChatGPT-4.0o achieved acceptable accuracy rates in some subject sub-categories analyzed.

摘要

目的

人工智能 (AI) 和基于大型语言模型 (LLM) 的 AI 应用程序 (LLMAs) 的出现对我们的社会具有重大影响。本研究分析了 LLMAs 在解决修复学和牙髓学 (RDE) 学生评估问题方面的性能。

材料与方法

使用 OpenAI（ChatGPT-3.5、-4.0 和 -4.0o）和 Google（Gemini 1.0）的 LLMAs 为 151 个来自 RDE 题库的问题准备提示。将选择题分为四个问题子类别，输入到 LLMAs 中并记录答案进行分析。使用 Python 3.9.16 进行 P 值和卡方统计分析。

结果

ChatGPT-4.0o 的总答案准确率最高，其次是 ChatGPT-4.0、Gemini 1.0 和 ChatGPT-3.5（分别为 72%、62%、44%和 25%），除了 GPT-4.0 模型外，所有 LLMAs 之间均存在显著差异。直接修复和龋齿的表现最高，其次是间接修复和牙髓学。

结论

总体而言，不同的 LLMAs 之间存在较大的性能差异。只有 ChatGPT-4 模型的成功率可以谨慎使用，以支持牙科学术课程。

临床相关性

虽然 LLMAs 可以支持临床医生回答与牙科领域相关的问题，但这种能力强烈依赖于所使用的模型。表现最出色的模型 ChatGPT-4.0o 在分析的一些学科子类别中达到了可接受的准确率。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大型语言人工智能模型在解决修复牙科和牙髓学生评估方面的性能。

Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments.

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSIONS

CLINICAL RELEVANCE

目的

材料与方法

结果

结论

临床相关性

相似文献

引用本文的文献

本文引用的文献

大型语言人工智能模型在解决修复牙科和牙髓学生评估方面的性能。

Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments.

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSIONS

CLINICAL RELEVANCE

目的

材料与方法

结果

结论

临床相关性

相似文献

引用本文的文献

本文引用的文献