对GPT-4o和GPT-4在日本国家牙科考试中的表现进行的探索性评估。

An exploratory assessment of GPT-4o and GPT-4 performance on the Japanese National Dental Examination.

作者信息

Morishita Masaki, Fukuda Hikaru, Yamaguchi Shino, Muraoka Kosuke, Nakamura Taiji, Hayashi Masanari, Yoshioka Izumi, Ono Kentaro, Awano Shuji

机构信息

Division of Clinical Education Development and Research, Department of Oral Function, Kyushu Dental University, Kitakyushu, Japan.

Health Information Management Office, Kyushu Dental University Hospital, Kitakyushu, Japan.

出版信息

Saudi Dent J. 2024 Dec;36(12):1577-1581. doi: 10.1016/j.sdentj.2024.11.006. Epub 2024 Nov 26.

DOI:10.1016/j.sdentj.2024.11.006

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11976070/

Abstract

BACKGROUND AND OBJECTIVES

Multiple large language models (LLMs) have been released since 2022, including OpenAI's GPT-3.5 and GPT-4. The latest model, GPT-4o, introduced on May 13, 2024, significantly improves GPT-4. Previous studies have shown the potential of LLMs as educational tools in medical and dental exams. This study evaluates the accuracy of GPT-4 and GPT-4o responses for the Japanese National Dental Examination (JNDE) to assess their potential as educational tools for dental education.

MATERIALS AND METHODS

We obtained the dataset of the 117th JNDE, administered in January 2024, consisting of 360 questions. After excluding questions with images and inappropriate ones, 202 questions were selected. GPT-4 and GPT-4o were used to generate responses. Standardized prompts ensured consistent input. Data analysis used Qlik Sense® and GraphPad Prism, employing Fisher's exact test.

RESULTS

GPT-4o showed a significantly higher correct response rate (73.8%) than GPT-4 (63.3%). In the compulsory section, GPT-4o achieved 88.6% accuracy, significantly higher than GPT-4's 74.3%. Though not statistically significant, the general section saw an improvement with GPT-4o (66.4%) over GPT-4 (58.0%).

CONCLUSION

GPT-4o significantly outperformed GPT-4 in accuracy for JNDE questions, suggesting its improved potential as an educational tool in dental education. Further studies are needed to evaluate GPT-4o's capabilities with visual materials and in diverse question sets to fully ascertain its utility in educational settings.

摘要

背景与目的

自2022年以来，多个大型语言模型（LLM）已发布，包括OpenAI的GPT-3.5和GPT-4。最新的模型GPT-4o于2024年5月13日推出，对GPT-4有显著改进。先前的研究表明LLM在医学和牙科考试中作为教育工具的潜力。本研究评估GPT-4和GPT-4o对日本国家牙科考试（JNDE）回答的准确性，以评估它们作为牙科教育工具的潜力。

材料与方法

我们获取了2024年1月进行的第117次JNDE的数据集，其中包含360道题。在排除带有图像和不适当的题目后，选择了202道题。使用GPT-4和GPT-4o生成回答。标准化提示确保输入一致。数据分析使用Qlik Sense®和GraphPad Prism，采用Fisher精确检验。

结果

GPT-4o的正确回答率（73.8%）显著高于GPT-4（63.3%）。在必修部分，GPT-4o的准确率达到88.6%，显著高于GPT-4的74.3%。在一般部分，虽然无统计学意义，但GPT-4o（66.4%）比GPT-4（58.0%）有所提高。

结论

在JNDE问题的准确性方面，GPT-4o显著优于GPT-4，表明其作为牙科教育工具的潜力有所提升。需要进一步研究以评估GPT-4o在视觉材料和不同问题集方面的能力，以充分确定其在教育环境中的效用。

相似文献

1

An exploratory assessment of GPT-4o and GPT-4 performance on the Japanese National Dental Examination.对GPT-4o和GPT-4在日本国家牙科考试中的表现进行的探索性评估。

Saudi Dent J. 2024 Dec;36(12):1577-1581. doi: 10.1016/j.sdentj.2024.11.006. Epub 2024 Nov 26.

2

Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用：性能评估研究

JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.

3

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性：横断面研究

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

4

Assessing the Diagnostic Capabilities of ChatGPT-4 Omni in Grading Diabetic Retinopathy Fundoscopy Using Color Fundus Photographs.评估ChatGPT-4 Omni利用彩色眼底照片对糖尿病视网膜病变眼底镜检查进行分级的诊断能力。

Clin Ophthalmol. 2025 Aug 31;19:3103-3112. doi: 10.2147/OPTH.S517238. eCollection 2025.

5

The performance of ChatGPT on medical image-based assessments and implications for medical education.ChatGPT在基于医学图像的评估中的表现及其对医学教育的影响。

BMC Med Educ. 2025 Aug 23;25(1):1192. doi: 10.1186/s12909-025-07752-0.

6

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响：比较案例研究

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

7

Large language models (LLMs) in radiology exams for medical students: Performance and consequences.面向医学生的放射学考试中的大语言模型：表现与影响。

Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.

8

Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models.使用OpenAI的o1与其他大语言模型进行眼科问答和推理

JAMA Ophthalmol. 2025 Jul 31. doi: 10.1001/jamaophthalmol.2025.2413.

9

Evaluating Large Language Models for Enhancing Radiology Specialty Examination: A Comparative Study with Human Performance.评估用于增强放射学专业考试的大语言模型：与人类表现的对比研究。

Acad Radiol. 2025 May 27. doi: 10.1016/j.acra.2025.05.023.

10

Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.GPT-4o结合检索增强生成在营养师执照考试问题上的表现。

Endocr J. 2025 Sep 11. doi: 10.1507/endocrj.EJ25-0201.

引用本文的文献

1

Benchmarking multimodal large language models on the dental licensing examination: Challenges with clinical image interpretation.在牙科执照考试中对多模态大语言模型进行基准测试：临床图像解读面临的挑战。

J Dent Sci. 2025 Oct;20(4):2427-2435. doi: 10.1016/j.jds.2025.03.018. Epub 2025 Mar 26.

本文引用的文献

1

Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat.评估领先的大语言模型在日本国家牙科保健员考试中的功效：ChatGPT、Bard和必应聊天的比较分析。

J Dent Sci. 2024 Oct;19(4):2262-2267. doi: 10.1016/j.jds.2024.02.019. Epub 2024 Feb 29.

2

Evaluating GPT-4V's performance in the Japanese national dental examination: A challenge explored.评估GPT-4V在日本国家牙科考试中的表现：一项探索性挑战。

J Dent Sci. 2024 Jul;19(3):1595-1600. doi: 10.1016/j.jds.2023.12.007. Epub 2023 Dec 22.

3

The Performance of ChatGPT-4V in Interpreting Images and Tables in the Japanese Medical Licensing Exam.ChatGPT-4V在日本医师执照考试中对图像和表格的解读表现。

JMIR Med Educ. 2024 May 23;10:e54283. doi: 10.2196/54283.

4

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V（视觉）在日本国家医师资格考试中的能力：评估研究。

JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.

5

ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study.ChatGPT（GPT-4）于 2022 年通过了日本药师国家考试，回答了包括图表题在内的所有题目：一项描述性研究。

J Educ Eval Health Prof. 2024;21:4. doi: 10.3352/jeehp.2024.21.4. Epub 2024 Feb 28.

6

The Potential of GPT-4 as a Support Tool for Pharmacists: Analytical Study Using the Japanese National Examination for Pharmacists.GPT-4作为药剂师辅助工具的潜力：使用日本药剂师国家考试的分析研究

JMIR Med Educ. 2023 Oct 30;9:e48452. doi: 10.2196/48452.

7

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性：评估研究

JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.

8

Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination.评估GPT-3.5和GPT-4在2023年日本护理考试中的表现。

Cureus. 2023 Aug 3;15(8):e42924. doi: 10.7759/cureus.42924. eCollection 2023 Aug.

9

Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现：比较研究。

JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.

10

Performance of the Large Language Model ChatGPT on the National Nurse Examinations in Japan: Evaluation Study.大型语言模型ChatGPT在日本国家护士考试中的表现：评估研究

JMIR Nurs. 2023 Jun 27;6:e47305. doi: 10.2196/47305.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验