Nguyen Huy Cong, Dang Hai Phong, Nguyen Thuy Linh, Hoang Viet, Nguyen Viet Anh
Faculty of Dentistry, PHENIKAA University, Hanoi, Vietnam.
Faculty of Dentistry, Van Lang University, Ho Chi Minh City, Vietnam.
PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.
This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions.
A total of 1490 MCQs from two board review books for the United States National Board Dental Examination were selected. This study evaluated six of the latest LLMs as of August 2024, including ChatGPT 4.0 omni (OpenAI), Gemini Advanced 1.5 Pro (Google), Copilot Pro with GPT-4 Turbo (Microsoft), Claude 3.5 Sonnet (Anthropic), Mistral Large 2 (Mistral AI), and Llama 3.1 405b (Meta). χ2 tests were performed to determine whether there were significant differences in the percentages of correct answers among LLMs for both the total sample and each discipline (p < 0.05).
Significant differences were observed in the percentage of accurate answers among the six LLMs across text-based questions, image-based questions, and the total sample (p<0.001). For the total sample, Copilot (85.5%), Claude (84.0%), and ChatGPT (83.8%) demonstrated the highest accuracy, followed by Mistral (78.3%) and Gemini (77.1%), with Llama (72.4%) exhibiting the lowest.
Newer versions of LLMs demonstrate superior performance in answering dental MCQs compared to earlier versions. Copilot, Claude, and ChatGPT achieved high accuracy on text-based questions and low accuracy on image-based questions. LLMs capable of handling image-based questions demonstrated superior performance compared to LLMs limited to text-based questions.
Dental clinicians and students should prioritize the most up-to-date LLMs when supporting their learning, clinical practice, and research.
本研究旨在评估最新的大语言模型(LLMs)在回答牙科多项选择题(MCQs)方面的表现,包括基于文本的问题和基于图像的问题。
从两本用于美国国家牙科委员会考试的复习书中选取了总共1490道多项选择题。本研究评估了截至2024年8月的六个最新的大语言模型,包括ChatGPT 4.0 omni(OpenAI)、Gemini Advanced 1.5 Pro(谷歌)、Copilot Pro with GPT-4 Turbo(微软)、Claude 3.5 Sonnet(Anthropic)、Mistral Large 2(Mistral AI)和Llama 3.1 405b(Meta)。进行卡方检验以确定在总样本和各学科中,大语言模型之间正确答案的百分比是否存在显著差异(p < 0.05)。
在基于文本的问题、基于图像的问题以及总样本中,六个大语言模型的准确答案百分比存在显著差异(p<0.001)。对于总样本,Copilot(85.5%)、Claude(84.0%)和ChatGPT(83.8%)表现出最高的准确率,其次是Mistral(78.3%)和Gemini(77.1%),Llama(72.4%)的准确率最低。
与早期版本相比,更新版本的大语言模型在回答牙科多项选择题方面表现更优。Copilot、Claude和ChatGPT在基于文本的问题上准确率高,在基于图像的问题上准确率低。能够处理基于图像问题的大语言模型比仅限于基于文本问题的大语言模型表现更优。
牙科临床医生和学生在支持他们的学习、临床实践和研究时,应优先选择最新的大语言模型。