• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估人工智能聊天机器人在口腔颌面外科医师资格考试中的表现与潜力

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.

作者信息

Mahmoud Reema, Shuster Amir, Kleinman Shlomi, Arbel Shimrit, Ianculovici Clariel, Peleg Oren

机构信息

Resident, Department of Oral and Maxillofacial Surgery, Tel-Aviv Sourasky Medical Center, Tel Aviv, Israel.

Senior Surgeon, Department of Oral and Maxillofacial Surgery, Tel-Aviv Sourasky Medical Center, Tel Aviv, Israel; Senior Surgeon, Department of Oral and Maxillofacial Surgery, Goldschleger School of Dental Medicine, Tel-Aviv University, Tel-Aviv, Israel.

出版信息

J Oral Maxillofac Surg. 2025 Mar;83(3):382-389. doi: 10.1016/j.joms.2024.11.007. Epub 2024 Nov 19.

DOI:10.1016/j.joms.2024.11.007
PMID:39642920
Abstract

BACKGROUND

While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored.

PURPOSE

This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement.

STUDY DESIGN, SETTING, AND SAMPLE: An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions.

PREDICTOR VARIABLE

The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA).

MAIN OUTCOME VARIABLES

The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery.

COVARIATES

No additional covariates were considered.

ANALYSES

Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ tests were used to assess response consistency and error correction, with statistical significance set at P < .05.

RESULTS

LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001).

CONCLUSION AND RELEVANCE

LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.

摘要

背景

虽然人工智能对医学产生了重大影响,但大语言模型(LLMs)在口腔颌面外科(OMS)中的应用仍未得到充分探索。

目的

本研究旨在测量和比较4种领先的大语言模型在口腔颌面外科委员会考试问题上的准确性,并确定需要改进的具体领域。

研究设计、设置和样本:进行了一项计算机模拟横断面研究,以评估4个人工智能聊天机器人对714道口腔颌面外科委员会考试问题的回答。

预测变量

预测变量是所使用的大语言模型——大语言模型1(生成式预训练变换器4.0 [GPT - 4.0],OpenAI,旧金山,加利福尼亚州)、大语言模型2(生成式预训练变换器3.5 [GPT - 3.5],OpenAI,旧金山,加利福尼亚州)、大语言模型3(Gemini,谷歌,山景城,加利福尼亚州)和大语言模型4(Copilot,微软,雷德蒙德,华盛顿州)。

主要结果变量

主要结果变量是准确性,定义为每个大语言模型提供的正确答案的百分比。次要结果包括大语言模型在后续尝试中纠正错误的能力以及它们在11个特定口腔颌面外科主题领域的表现:医学与麻醉、牙体牙髓与种植外科、颌面创伤、颌面感染、颌面病理学、唾液腺、肿瘤学、颌面重建、颞下颌关节解剖与病理学、颅面与腭裂以及正颌外科。

协变量

未考虑其他协变量。

分析

统计分析包括单向方差分析和事后Tukey诚实显著差异(HSD)检验,以比较聊天机器人之间的性能。χ检验用于评估回答的一致性和错误纠正,统计学显著性设定为P <.05。

结果

大语言模型1的准确性最高,平均得分83.69%,在统计学上显著优于大语言模型3(66.85%,P =.002)、大语言模型2(64.83%,P =.001)和大语言模型4(62.18%,P <.001)。在11个口腔颌面外科主题领域中,大语言模型1始终具有最高的准确率。大语言模型1还纠正了98.2%的错误,而大语言模型2纠正了93.44%,两者在统计学上均显著高于大语言模型4(29.26%)和大语言模型3(70.71%)(P <.001)。

结论及相关性

大语言模型1(GPT - 4.0)在准确性和错误纠正方面均显著优于其他模型,表明其作为增强口腔颌面外科教育工具的强大潜力。然而,不同领域性能的差异凸显了持续改进和持续评估的必要性,以便更有效地将这些大语言模型整合到口腔颌面外科领域。

相似文献

1
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.评估人工智能聊天机器人在口腔颌面外科医师资格考试中的表现与潜力
J Oral Maxillofac Surg. 2025 Mar;83(3):382-389. doi: 10.1016/j.joms.2024.11.007. Epub 2024 Nov 19.
2
Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.用土耳其医学肿瘤学会年度委员会考试问题对大型语言模型聊天机器人的肿瘤学知识进行基准测试。
BMC Cancer. 2025 Feb 4;25(1):197. doi: 10.1186/s12885-025-13596-0.
3
Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.ChatGPT-4o和谷歌Gemini在基于图像的神经外科委员会问题上的表现准确性和质量。
Neurosurg Rev. 2025 Mar 25;48(1):320. doi: 10.1007/s10143-025-03472-7.
4
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.生物化学教育中的大语言模型:性能的比较评估
JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.
5
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.
6
Performance of large language models in oral and maxillofacial surgery examinations.大型语言模型在口腔颌面外科学考试中的表现。
Int J Oral Maxillofac Surg. 2024 Oct;53(10):881-886. doi: 10.1016/j.ijom.2024.06.003. Epub 2024 Jun 25.
7
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.全球医学考试中的大语言模型:平台开发与综合分析
J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.
8
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估:观察性比较案例研究
J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.
9
Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam.评估牙科麻醉学中的大语言模型:ChatGPT-4、Claude 3 Opus和Gemini 1.0在日本麻醉学牙科协会委员会认证考试中的比较分析。
Cureus. 2024 Sep 27;16(9):e70302. doi: 10.7759/cureus.70302. eCollection 2024 Sep.
10
Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗?骨科住院医师与ChatGPT的对比。
Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

引用本文的文献

1
Evaluation of deepseek, gemini, ChatGPT-4o, and perplexity in responding to salivary gland cancer.评估DeepSeek、Gemini、ChatGPT-4o和Perplexity对涎腺癌的回答。
BMC Oral Health. 2025 Aug 23;25(1):1358. doi: 10.1186/s12903-025-06726-4.
2
Performance of AI Chatbots in Preliminary Diagnosis of Maxillofacial Pathologies.人工智能聊天机器人在颌面疾病初步诊断中的表现。
Med Sci Monit. 2025 Jul 9;31:e949076. doi: 10.12659/MSM.949076.