Suppr超能文献

评估领先的大语言模型在日本国家牙科保健员考试中的功效:ChatGPT、Bard和必应聊天的比较分析。

Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat.

作者信息

Yamaguchi Shino, Morishita Masaki, Fukuda Hikaru, Muraoka Kosuke, Nakamura Taiji, Yoshioka Izumi, Soh Inho, Ono Kentaro, Awano Shuji

机构信息

School of Oral Health Sciences, Kyushu Dental University, Kitakyushu, Japan.

Division of Clinical Education Development and Research, Department of Oral Function, Kyushu Dental University, Kitakyushu, Japan.

出版信息

J Dent Sci. 2024 Oct;19(4):2262-2267. doi: 10.1016/j.jds.2024.02.019. Epub 2024 Feb 29.

Abstract

BACKGROUND/PURPOSE: Large language models (LLMs) such as OpenAI's ChatGPT, Google's Bard, and Microsoft's Bing Chat have shown potential as educational tools in the medical and dental fields. This study evaluated their effectiveness using questions from the Japanese national dental hygienist examination, focusing on textual information only.

MATERIALS AND METHODS

We analyzed 73 questions from the 32nd Japanese national dental hygienist examination, conducted in March 2023, using LLMs ChatGPT-3.5, GPT-4, Bard, and Bing Chat. Each question was categorized into one of nine domains. Standardized prompts were used for all LLMs, and Fisher's exact test was applied for statistical analysis.

RESULTS

GPT-4 achieved the highest accuracy (75.3%), followed by Bing (68.5%), Bard (66.7%), and GPT-3.5 (63.0%). There were no statistically significant differences between the LLMs. The performance varied across different question categories, with all models excelling in the 'Disease mechanism and promotion of recovery process' category (100% accuracy). GPT-4 generally outperformed other models, especially in multi-answer questions.

CONCLUSION

GPT-4 demonstrated the highest overall accuracy among the LLMs tested, indicating its superior potential as an educational support tool in dental hygiene studies. The study highlights the varied performance of different LLMs across various question categories. While GPT-4 is currently the most effective, the capabilities of LLMs in educational settings are subject to continual change and improvement.

摘要

背景/目的:诸如OpenAI的ChatGPT、谷歌的Bard和微软的必应聊天等大语言模型已显示出在医学和牙科领域作为教育工具的潜力。本研究仅关注文本信息,使用日本国家牙科保健员考试的问题评估了它们的有效性。

材料与方法

我们使用大语言模型ChatGPT-3.5、GPT-4、Bard和必应聊天,分析了2023年3月举行的第32届日本国家牙科保健员考试中的73道问题。每个问题被归类到九个领域之一。对所有大语言模型使用标准化提示,并应用费舍尔精确检验进行统计分析。

结果

GPT-4的准确率最高(75.3%),其次是必应(68.5%)、Bard(66.7%)和GPT-3.5(63.0%)。各模型之间无统计学显著差异。不同问题类别的表现有所不同,所有模型在“疾病机制与恢复过程促进”类别中表现出色(准确率100%)。GPT-4总体上优于其他模型,尤其是在多答案问题上。

结论

在测试的大语言模型中,GPT-4总体准确率最高,表明其在牙科保健研究中作为教育支持工具具有卓越潜力。该研究突出了不同大语言模型在各类问题上的不同表现。虽然GPT-4目前是最有效的,但大语言模型在教育环境中的能力会不断变化和提高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e943/11437298/0a1f15e63468/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验