评估领先的大语言模型在日本国家牙科保健员考试中的功效：ChatGPT、Bard和必应聊天的比较分析。

Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat.

作者信息

Yamaguchi Shino, Morishita Masaki, Fukuda Hikaru, Muraoka Kosuke, Nakamura Taiji, Yoshioka Izumi, Soh Inho, Ono Kentaro, Awano Shuji

机构信息

School of Oral Health Sciences, Kyushu Dental University, Kitakyushu, Japan.

Division of Clinical Education Development and Research, Department of Oral Function, Kyushu Dental University, Kitakyushu, Japan.

出版信息

J Dent Sci. 2024 Oct;19(4):2262-2267. doi: 10.1016/j.jds.2024.02.019. Epub 2024 Feb 29.

DOI:10.1016/j.jds.2024.02.019

PMID:39347065

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11437298/

Abstract

BACKGROUND/PURPOSE: Large language models (LLMs) such as OpenAI's ChatGPT, Google's Bard, and Microsoft's Bing Chat have shown potential as educational tools in the medical and dental fields. This study evaluated their effectiveness using questions from the Japanese national dental hygienist examination, focusing on textual information only.

MATERIALS AND METHODS

We analyzed 73 questions from the 32nd Japanese national dental hygienist examination, conducted in March 2023, using LLMs ChatGPT-3.5, GPT-4, Bard, and Bing Chat. Each question was categorized into one of nine domains. Standardized prompts were used for all LLMs, and Fisher's exact test was applied for statistical analysis.

RESULTS

GPT-4 achieved the highest accuracy (75.3%), followed by Bing (68.5%), Bard (66.7%), and GPT-3.5 (63.0%). There were no statistically significant differences between the LLMs. The performance varied across different question categories, with all models excelling in the 'Disease mechanism and promotion of recovery process' category (100% accuracy). GPT-4 generally outperformed other models, especially in multi-answer questions.

CONCLUSION

GPT-4 demonstrated the highest overall accuracy among the LLMs tested, indicating its superior potential as an educational support tool in dental hygiene studies. The study highlights the varied performance of different LLMs across various question categories. While GPT-4 is currently the most effective, the capabilities of LLMs in educational settings are subject to continual change and improvement.

摘要

背景/目的：诸如OpenAI的ChatGPT、谷歌的Bard和微软的必应聊天等大语言模型已显示出在医学和牙科领域作为教育工具的潜力。本研究仅关注文本信息，使用日本国家牙科保健员考试的问题评估了它们的有效性。

材料与方法

我们使用大语言模型ChatGPT-3.5、GPT-4、Bard和必应聊天，分析了2023年3月举行的第32届日本国家牙科保健员考试中的73道问题。每个问题被归类到九个领域之一。对所有大语言模型使用标准化提示，并应用费舍尔精确检验进行统计分析。

结果

GPT-4的准确率最高（75.3%），其次是必应（68.5%）、Bard（66.7%）和GPT-3.5（63.0%）。各模型之间无统计学显著差异。不同问题类别的表现有所不同，所有模型在“疾病机制与恢复过程促进”类别中表现出色（准确率100%）。GPT-4总体上优于其他模型，尤其是在多答案问题上。

结论

在测试的大语言模型中，GPT-4总体准确率最高，表明其在牙科保健研究中作为教育支持工具具有卓越潜力。该研究突出了不同大语言模型在各类问题上的不同表现。虽然GPT-4目前是最有效的，但大语言模型在教育环境中的能力会不断变化和提高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e943/11437298/0a1f15e63468/gr1.jpg

相似文献

Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat.评估领先的大语言模型在日本国家牙科保健员考试中的功效：ChatGPT、Bard和必应聊天的比较分析。

J Dent Sci. 2024 Oct;19(4):2262-2267. doi: 10.1016/j.jds.2024.02.019. Epub 2024 Feb 29.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力：ChatGPT、谷歌巴德和微软必应的比较研究

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.人工智能在减重手术中的表现：ChatGPT-4、Bing 和 Bard 在《美国代谢与减重外科学会减重手术教科书》减重手术问题中的比较分析。

Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8.

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study.GPT-3.5、GPT-4和Bard在日本国家牙科医师考试中的表现：一项比较研究。

Cureus. 2023 Dec 12;15(12):e50369. doi: 10.7759/cureus.50369. eCollection 2023 Dec.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.ChatGPT、GPT-4 和 Bard 在日本放射学会官方董事会考试中的表现评估。

Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.将 ChatGPT-3.5、ChatGPT-4、Bing Chat 和 Bard 用于韩国急诊医学 board 考试题库的问题解决性能比较。

Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能：比较混合方法研究。

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use.评估人工智能语言模型在提供甲氨蝶呤使用信息方面的准确性和完整性。

Rheumatol Int. 2024 Mar;44(3):509-515. doi: 10.1007/s00296-023-05473-5. Epub 2023 Sep 25.

Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.人工智能在麻醉学 board 式考试问题中的应用：大语言模型的作用。

J Cardiothorac Vasc Anesth. 2024 May;38(5):1251-1259. doi: 10.1053/j.jvca.2024.01.032. Epub 2024 Feb 1.

引用本文的文献

Evaluating the accuracy of CHATGPT models in answering multiple-choice questions on oral and maxillofacial pathologies and oral radiology.评估ChatGPT模型在回答口腔颌面病理学和口腔放射学多项选择题方面的准确性。

Digit Health. 2025 Jul 8;11:20552076251355847. doi: 10.1177/20552076251355847. eCollection 2025 Jan-Dec.

Evaluating the influence of prompt formulation on the reliability and repeatability of ChatGPT in implant-supported prostheses.评估提示词制定对ChatGPT在种植体支持式修复体方面的可靠性和可重复性的影响。

PLoS One. 2025 May 30;20(5):e0323086. doi: 10.1371/journal.pone.0323086. eCollection 2025.

Assessing ChatGPT-4's performance on the US prosthodontic exam: impact of fine-tuning and contextual prompting vs. base knowledge, a cross-sectional study.评估ChatGPT-4在美国口腔修复学考试中的表现：微调与情境提示对比基础知识的影响，一项横断面研究

BMC Med Educ. 2025 May 23;25(1):761. doi: 10.1186/s12909-025-07371-9.

Enhancing patient-centered information on implant dentistry through prompt engineering: a comparison of four large language models.通过提示工程增强种植牙科以患者为中心的信息：四种大语言模型的比较

Front Oral Health. 2025 Apr 7;6:1566221. doi: 10.3389/froh.2025.1566221. eCollection 2025.

Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study.大型语言模型能否创建可接受的牙科学术委员会风格的考试问题？一项横断面前瞻性研究。

J Dent Sci. 2025 Apr;20(2):895-900. doi: 10.1016/j.jds.2024.08.020. Epub 2024 Sep 11.

Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis.大型语言模型在牙科执照考试中的应用：系统评价与荟萃分析

Int Dent J. 2025 Feb;75(1):213-222. doi: 10.1016/j.identj.2024.10.014. Epub 2024 Nov 12.

Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions.韩国国家口腔卫生士考试中韩语和英语问题的大语言模型回答准确率的比较分析

Int J Dent Hyg. 2025 May;23(2):267-276. doi: 10.1111/idh.12848. Epub 2024 Oct 16.

本文引用的文献

The Potential of GPT-4 as a Support Tool for Pharmacists: Analytical Study Using the Japanese National Examination for Pharmacists.GPT-4作为药剂师辅助工具的潜力：使用日本药剂师国家考试的分析研究

JMIR Med Educ. 2023 Oct 30;9:e48452. doi: 10.2196/48452.

Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study.比较大型语言模型ChatGPT、BARD和必应人工智能在提供隆鼻信息方面的功效：一项观察性研究。

Aesthet Surg J Open Forum. 2023 Sep 14;5:ojad084. doi: 10.1093/asjof/ojad084. eCollection 2023.

Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.大语言模型在血液学病例解决中的应用：ChatGPT-3.5、谷歌巴德和微软必应的比较研究

Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.

Efficacy of AI Chats to Determine an Emergency: A Comparison Between OpenAI's ChatGPT, Google Bard, and Microsoft Bing AI Chat.人工智能聊天工具在判定紧急情况方面的效能：OpenAI的ChatGPT、谷歌巴德和微软必应人工智能聊天工具的比较

Cureus. 2023 Sep 18;15(9):e45473. doi: 10.7759/cureus.45473. eCollection 2023 Sep.

The performance of artificial intelligence language models in board-style dental knowledge assessment: A preliminary study on ChatGPT.人工智能语言模型在棋盘式牙科知识评估中的表现：ChatGPT 的初步研究。

J Am Dent Assoc. 2023 Nov;154(11):970-974. doi: 10.1016/j.adaj.2023.07.016. Epub 2023 Sep 5.

Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现：比较研究。

JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.

Performance of the Large Language Model ChatGPT on the National Nurse Examinations in Japan: Evaluation Study.大型语言模型ChatGPT在日本国家护士考试中的表现：评估研究

JMIR Nurs. 2023 Jun 27;6:e47305. doi: 10.2196/47305.

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。

Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估领先的大语言模型在日本国家牙科保健员考试中的功效：ChatGPT、Bard和必应聊天的比较分析。

Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat.

作者信息

机构信息

出版信息

MATERIALS AND METHODS

RESULTS

CONCLUSION

材料与方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献