Viet Anh Nguyen, Trang Nguyen Thi
Faculty of Dentistry, Phenikaa University, Hanoi, Vietnam.
Eur J Dent Educ. 2025 Jul 16. doi: 10.1111/eje.70015.
Although some studies have investigated the application of large language models (LLMs) in generating dental-related multiple-choice questions (MCQs), they have primarily focused on ChatGPT and Gemini. This study aims to evaluate and compare the performance of five of the LLMs in generating dental board-style questions.
This prospective cross-sectional study evaluated five of the advanced LLMs as of August 2024, including ChatGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Copilot Pro (Microsoft), Gemini 1.5 Pro (Google) and Mistral Large 2 (Mistral AI). The five most recent clinical guidelines published by The Journal of the American Dental Association were used to generate a total of 350 questions (70 questions per LLM). Each question was independently evaluated by two investigators based on five criteria: clarity, relevance, suitability, distractor and rationale, using a 10-point Likert scale.
Inter-rater reliability was substantial (kappa score: 0.7-0.8). Median scores for clarity, relevance and rationale were above 9 across all five LLMs. Suitability and distractor had median scores ranging from 8 to 9. Within each LLM, clarity and relevance scored higher than other criteria (p < 0.05). No significant difference was observed between models regarding clarity, relevance and suitability (p > 0.05). Claude 3.5 Sonnet outperformed other models in providing rationales for answers (p < 0.01).
LLMs demonstrate strong capabilities in generating high-quality, clinically relevant dental board-style questions. Among them, Claude 3.5 Sonnet exhibited the best performance in providing rationales for answers.
尽管一些研究已经探讨了大语言模型(LLMs)在生成牙科相关多项选择题(MCQs)中的应用,但这些研究主要集中在ChatGPT和Gemini上。本研究旨在评估和比较五个大语言模型在生成牙科委员会风格问题方面的表现。
这项前瞻性横断面研究评估了截至2024年8月的五个先进大语言模型,包括ChatGPT-4o(OpenAI)、Claude 3.5 Sonnet(Anthropic)、Copilot Pro(微软)、Gemini 1.5 Pro(谷歌)和Mistral Large 2(Mistral AI)。使用美国牙科协会杂志发布的五个最新临床指南共生成了350个问题(每个大语言模型70个问题)。两名研究者根据清晰度、相关性、适用性、干扰项和原理五个标准,使用10分李克特量表对每个问题进行独立评估。
评分者间信度较高(kappa值:0.7 - 0.8)。所有五个大语言模型在清晰度、相关性和原理方面的中位数得分均高于9分。适用性和干扰项的中位数得分在8至9分之间。在每个大语言模型中,清晰度和相关性的得分高于其他标准(p < 0.05)。在清晰度、相关性和适用性方面,各模型之间未观察到显著差异(p > 0.05)。Claude 3.5 Sonnet在提供答案原理方面优于其他模型(p < 0.01)。
大语言模型在生成高质量、临床相关的牙科委员会风格问题方面表现出强大能力。其中,Claude 3.5 Sonnet在提供答案原理方面表现最佳。