最新大语言模型在回答牙科多项选择题方面的准确性：一项比较研究。

Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.

作者信息

Nguyen Huy Cong, Dang Hai Phong, Nguyen Thuy Linh, Hoang Viet, Nguyen Viet Anh

机构信息

Faculty of Dentistry, PHENIKAA University, Hanoi, Vietnam.

Faculty of Dentistry, Van Lang University, Ho Chi Minh City, Vietnam.

出版信息

PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.

DOI:10.1371/journal.pone.0317423

PMID:39879192

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11778630/

Abstract

OBJECTIVES

This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions.

MATERIAL AND METHODS

A total of 1490 MCQs from two board review books for the United States National Board Dental Examination were selected. This study evaluated six of the latest LLMs as of August 2024, including ChatGPT 4.0 omni (OpenAI), Gemini Advanced 1.5 Pro (Google), Copilot Pro with GPT-4 Turbo (Microsoft), Claude 3.5 Sonnet (Anthropic), Mistral Large 2 (Mistral AI), and Llama 3.1 405b (Meta). χ2 tests were performed to determine whether there were significant differences in the percentages of correct answers among LLMs for both the total sample and each discipline (p < 0.05).

RESULTS

Significant differences were observed in the percentage of accurate answers among the six LLMs across text-based questions, image-based questions, and the total sample (p<0.001). For the total sample, Copilot (85.5%), Claude (84.0%), and ChatGPT (83.8%) demonstrated the highest accuracy, followed by Mistral (78.3%) and Gemini (77.1%), with Llama (72.4%) exhibiting the lowest.

CONCLUSIONS

Newer versions of LLMs demonstrate superior performance in answering dental MCQs compared to earlier versions. Copilot, Claude, and ChatGPT achieved high accuracy on text-based questions and low accuracy on image-based questions. LLMs capable of handling image-based questions demonstrated superior performance compared to LLMs limited to text-based questions.

CLINICAL RELEVANCE

Dental clinicians and students should prioritize the most up-to-date LLMs when supporting their learning, clinical practice, and research.

摘要

目的

本研究旨在评估最新的大语言模型（LLMs）在回答牙科多项选择题（MCQs）方面的表现，包括基于文本的问题和基于图像的问题。

材料与方法

从两本用于美国国家牙科委员会考试的复习书中选取了总共1490道多项选择题。本研究评估了截至2024年8月的六个最新的大语言模型，包括ChatGPT 4.0 omni（OpenAI）、Gemini Advanced 1.5 Pro（谷歌）、Copilot Pro with GPT-4 Turbo（微软）、Claude 3.5 Sonnet（Anthropic）、Mistral Large 2（Mistral AI）和Llama 3.1 405b（Meta）。进行卡方检验以确定在总样本和各学科中，大语言模型之间正确答案的百分比是否存在显著差异（p < 0.05）。

结果

在基于文本的问题、基于图像的问题以及总样本中，六个大语言模型的准确答案百分比存在显著差异（p<0.001）。对于总样本，Copilot（85.5%）、Claude（84.0%）和ChatGPT（83.8%）表现出最高的准确率，其次是Mistral（78.3%）和Gemini（77.1%），Llama（72.4%）的准确率最低。

结论

与早期版本相比，更新版本的大语言模型在回答牙科多项选择题方面表现更优。Copilot、Claude和ChatGPT在基于文本的问题上准确率高，在基于图像的问题上准确率低。能够处理基于图像问题的大语言模型比仅限于基于文本问题的大语言模型表现更优。

临床意义

牙科临床医生和学生在支持他们的学习、临床实践和研究时，应优先选择最新的大语言模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e7a/11778630/adbc13be03f2/pone.0317423.g001.jpg

相似文献

Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.最新大语言模型在回答牙科多项选择题方面的准确性：一项比较研究。

PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.

Assessment of the Large Language Models in Creating Dental Board-Style Questions: A Prospective Cross-Sectional Study.大型语言模型在创建牙科委员会风格问题方面的评估：一项前瞻性横断面研究。

Eur J Dent Educ. 2025 Jul 16. doi: 10.1111/eje.70015.

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响：比较案例研究

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现：基于循证问答的横断面基准研究

J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.

Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试

J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.

Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education.ChatGPT、Gemini与急诊医学实习最后一年学生在回答多项选择题方面的表现比较：人工智能在医学教育中的应用启示

Int J Emerg Med. 2025 Aug 7;18(1):146. doi: 10.1186/s12245-025-00949-6.

Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型：回答组织学问题的比较性跨平台评估

Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。

Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management.大语言模型在回答牙周根分叉病变管理临床问题中的性能评估

Dent J (Basel). 2025 Jun 18;13(6):271. doi: 10.3390/dj13060271.

引用本文的文献

Exploring the role of DeepSeek-R1, ChatGPT-4, and Google Gemini in medical education: How valid and reliable are they?探索DeepSeek-R1、ChatGPT-4和谷歌Gemini在医学教育中的作用：它们的有效性和可靠性如何？

Pak J Med Sci. 2025 Jul;41(7):1887-1892. doi: 10.12669/pjms.41.7.12183.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis.人工智能在回答口腔病理学选择题方面的表现：一项对比分析。

BMC Oral Health. 2025 Apr 15;25(1):573. doi: 10.1186/s12903-025-05926-2.

Correcting Multiple Spaces in Adult Patients With Precise Tooth Movement Control Using In-House Clear Aligners.使用定制的透明矫治器在成年患者中精确控制牙齿移动来纠正多个间隙

Clin Case Rep. 2025 Apr 4;13(4):e70393. doi: 10.1002/ccr3.70393. eCollection 2025 Apr.

本文引用的文献

Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性：使用日本国家医学考试的比较研究。

Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.

Legal aspects of generative artificial intelligence and large language models in examinations and theses.生成式人工智能和大型语言模型在考试和论文中的法律问题。

GMS J Med Educ. 2024 Sep 16;41(4):Doc47. doi: 10.3205/zma001702. eCollection 2024.

Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.目前可用的大语言模型并未提供与循证临床实践指南相一致的肌肉骨骼治疗建议。

Arthroscopy. 2025 Feb;41(2):263-275.e6. doi: 10.1016/j.arthro.2024.07.040. Epub 2024 Aug 22.

Is use of ChatGPT cheating? Students of health professions perceptions.使用ChatGPT算是作弊吗？卫生专业学生的看法。

Med Teach. 2025 May;47(5):894-898. doi: 10.1080/0142159X.2024.2385667. Epub 2024 Aug 4.

The conversational AI "ChatGPT" outperforms medical students on a physiology university examination.对话式人工智能“ChatGPT”在一场生理学大学考试中表现优于医学生。

Adv Physiol Educ. 2024 Dec 1;48(4):677-684. doi: 10.1152/advan.00181.2023. Epub 2024 Jul 11.

Performance of large language models in oral and maxillofacial surgery examinations.大型语言模型在口腔颌面外科学考试中的表现。

Int J Oral Maxillofac Surg. 2024 Oct;53(10):881-886. doi: 10.1016/j.ijom.2024.06.003. Epub 2024 Jun 25.

Leveraging Large Language Models in the delivery of post-operative dental care: a comparison between an embedded GPT model and ChatGPT.在术后牙科护理中利用大语言模型：嵌入式GPT模型与ChatGPT的比较

BDJ Open. 2024 Jun 12;10(1):48. doi: 10.1038/s41405-024-00226-3.

ChatGPT's performance in dentistry and allergyimmunology assessments: a comparative study.ChatGPT 在牙科和过敏免疫评估中的表现：一项比较研究。

Swiss Dent J. 2023 Oct 4;134(2):1-17. doi: 10.61872/sdj-2024-06-01.

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现：观察性研究。

JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.

Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.评估ChatGPT-4的诊断准确性：视觉数据整合的影响。

JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

最新大语言模型在回答牙科多项选择题方面的准确性：一项比较研究。

Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIAL AND METHODS

RESULTS

CONCLUSIONS

CLINICAL RELEVANCE

目的

材料与方法

结果

结论

临床意义

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献