Yamaguchi Shino, Morishita Masaki, Fukuda Hikaru, Muraoka Kosuke, Nakamura Taiji, Yoshioka Izumi, Soh Inho, Ono Kentaro, Awano Shuji
School of Oral Health Sciences, Kyushu Dental University, Kitakyushu, Japan.
Division of Clinical Education Development and Research, Department of Oral Function, Kyushu Dental University, Kitakyushu, Japan.
J Dent Sci. 2024 Oct;19(4):2262-2267. doi: 10.1016/j.jds.2024.02.019. Epub 2024 Feb 29.
BACKGROUND/PURPOSE: Large language models (LLMs) such as OpenAI's ChatGPT, Google's Bard, and Microsoft's Bing Chat have shown potential as educational tools in the medical and dental fields. This study evaluated their effectiveness using questions from the Japanese national dental hygienist examination, focusing on textual information only.
We analyzed 73 questions from the 32nd Japanese national dental hygienist examination, conducted in March 2023, using LLMs ChatGPT-3.5, GPT-4, Bard, and Bing Chat. Each question was categorized into one of nine domains. Standardized prompts were used for all LLMs, and Fisher's exact test was applied for statistical analysis.
GPT-4 achieved the highest accuracy (75.3%), followed by Bing (68.5%), Bard (66.7%), and GPT-3.5 (63.0%). There were no statistically significant differences between the LLMs. The performance varied across different question categories, with all models excelling in the 'Disease mechanism and promotion of recovery process' category (100% accuracy). GPT-4 generally outperformed other models, especially in multi-answer questions.
GPT-4 demonstrated the highest overall accuracy among the LLMs tested, indicating its superior potential as an educational support tool in dental hygiene studies. The study highlights the varied performance of different LLMs across various question categories. While GPT-4 is currently the most effective, the capabilities of LLMs in educational settings are subject to continual change and improvement.
背景/目的:诸如OpenAI的ChatGPT、谷歌的Bard和微软的必应聊天等大语言模型已显示出在医学和牙科领域作为教育工具的潜力。本研究仅关注文本信息,使用日本国家牙科保健员考试的问题评估了它们的有效性。
我们使用大语言模型ChatGPT-3.5、GPT-4、Bard和必应聊天,分析了2023年3月举行的第32届日本国家牙科保健员考试中的73道问题。每个问题被归类到九个领域之一。对所有大语言模型使用标准化提示,并应用费舍尔精确检验进行统计分析。
GPT-4的准确率最高(75.3%),其次是必应(68.5%)、Bard(66.7%)和GPT-3.5(63.0%)。各模型之间无统计学显著差异。不同问题类别的表现有所不同,所有模型在“疾病机制与恢复过程促进”类别中表现出色(准确率100%)。GPT-4总体上优于其他模型,尤其是在多答案问题上。
在测试的大语言模型中,GPT-4总体准确率最高,表明其在牙科保健研究中作为教育支持工具具有卓越潜力。该研究突出了不同大语言模型在各类问题上的不同表现。虽然GPT-4目前是最有效的,但大语言模型在教育环境中的能力会不断变化和提高。