Suppr超能文献

用于牙科入学考试的大语言模型基准测试。

Benchmarking of Large Language Models for the Dental Admission Test.

作者信息

Hou Yu, Patel Jay, Dai Liya, Zhang Emily, Liu Yang, Zhan Zaifu, Gangwani Pooja, Zhang Rui

机构信息

Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA.

Center for Learning Health System Sciences, University of Minnesota, Minneapolis, MN 55455, USA.

出版信息

Health Data Sci. 2025 Apr 1;5:0250. doi: 10.34133/hds.0250. eCollection 2025.

Abstract

Large language models (LLMs) have shown promise in educational applications, but their performance on high-stakes admissions tests, such as the Dental Admission Test (DAT), remains unclear. Understanding the capabilities and limitations of these models is critical for determining their suitability in test preparation. This study evaluated the ability of 16 LLMs, including general-purpose models (e.g., GPT-3.5, GPT-4, GPT-4o, GPT-o1, Google's Bard, mistral-large, and Claude), domain-specific fine-tuned models (e.g., DentalGPT, MedGPT, and BioGPT), and open-source models (e.g., Llama2-7B, Llama2-13B, Llama2-70B, Llama3-8B, and Llama3-70B), to answer questions from a sample DAT. Quantitative analysis was performed to assess model accuracy in different sections, and qualitative thematic analysis by subject matter experts examined specific challenges encountered by the models. GPT-4o and GPT-o1 outperformed others in text-based questions assessing knowledge and comprehension, with GPT-o1 achieving perfect scores in the natural sciences (NS) and reading comprehension (RC) sections. Open-source models such as Llama3-70B also performed competitively in RC tasks. However, all models, including GPT-4o, struggled substantially with perceptual ability (PA) items, highlighting a persistent limitation in handling image-based tasks requiring visual-spatial reasoning. Fine-tuned medical models (e.g., DentalGPT, MedGPT, and BioGPT) demonstrated moderate success in text-based tasks but underperformed in areas requiring critical thinking and reasoning. Thematic analysis identified key challenges, including difficulties with stepwise problem-solving, transferring knowledge, comprehending intricate questions, and hallucinations, particularly on advanced items. While LLMs show potential for reinforcing factual knowledge and supporting learners, their limitations in handling higher-order cognitive tasks and image-based reasoning underscore the need for judicious integration with instructor-led guidance and targeted practice. This study provides valuable insights into the capabilities and limitations of current LLMs in preparing prospective dental students and highlights pathways for future innovations to improve performance across all cognitive skills assessed by the DAT.

摘要

大语言模型(LLMs)在教育应用中已展现出潜力,但其在诸如牙科学院入学考试(DAT)等高风险入学测试中的表现仍不明确。了解这些模型的能力和局限性对于确定它们在备考中的适用性至关重要。本研究评估了16个大语言模型的能力,包括通用模型(如GPT - 3.5、GPT - 4、GPT - 4o、GPT - o1、谷歌的Bard、米斯特拉尔大模型和Claude)、特定领域微调模型(如DentalGPT、MedGPT和BioGPT)以及开源模型(如Llama2 - 7B、Llama2 - 13B、Llama2 - 70B、Llama3 - 8B和Llama3 - 70B),以回答DAT样本中的问题。进行了定量分析以评估模型在不同部分的准确性,由主题专家进行的定性主题分析则考察了模型遇到的具体挑战。GPT - 4o和GPT - o1在评估知识和理解的基于文本的问题上表现优于其他模型,GPT - o1在自然科学(NS)和阅读理解(RC)部分获得了满分。像Llama3 - 70B这样的开源模型在RC任务中也表现出竞争力。然而,所有模型,包括GPT - 4o,在感知能力(PA)项目上都遇到了很大困难,这突出了在处理需要视觉空间推理的基于图像的任务方面存在的持续局限性。微调后的医学模型(如DentalGPT、MedGPT和BioGPT)在基于文本的任务中取得了一定成功,但在需要批判性思维和推理的领域表现不佳。主题分析确定了关键挑战,包括逐步解决问题、知识迁移、理解复杂问题以及幻觉方面的困难,特别是在高级项目上。虽然大语言模型在强化事实性知识和支持学习者方面显示出潜力,但其在处理高阶认知任务和基于图像的推理方面的局限性凸显了与教师主导的指导和针对性练习进行明智整合的必要性。本研究为当前大语言模型在培养未来牙科学生方面的能力和局限性提供了有价值的见解,并突出了未来创新的途径,以提高在DAT评估的所有认知技能方面的表现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2327/11961047/8f2d4e8f0219/hds.0250.fig.001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验