Suppr超能文献

大语言模型在韩国牙科执照考试中的表现:一项比较研究。

Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study.

作者信息

Kim Woojun, Kim Bong Chul, Yeom Han-Gyeol

机构信息

The Robotics Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

Department of Oral and Maxillofacial Surgery, Daejeon Dental Hospital, Wonkwang University College of Dentistry, Daejeon, Korea.

出版信息

Int Dent J. 2025 Feb;75(1):176-184. doi: 10.1016/j.identj.2024.09.002. Epub 2024 Oct 6.

Abstract

PURPOSE

This study investigated the potential application of large language models (LLMs) in dental education and practice, with a focus on ChatGPT and Claude3-Opus. Using the Korean Dental Licensing Examination (KDLE) as a benchmark, we aimed to assess the capabilities of these models in the dental field.

METHODS

This study evaluated three LLMs: GPT-3.5, GPT-4 (version: March 2024), and Claude3-Opus (version: March 2024). We used the KDLE questionnaire from 2019 to 2023 as inputs to the LLMs and then used the outputs from the LLMs as the corresponding answers. The total scores for individual subjects were obtained and compared. We also compared the performance of LLMs with those of individuals who underwent the exams.

RESULTS

Claude3-Opus performed best among the considered LLMs, except in 2019 when ChatGPT-4 performed best. Claude3-Opus and ChatGPT-4 surpassed the cut-off scores in all the years considered; this indicated that Claude3-Opus and ChatGPT-4 passed the KDLE, whereas ChatGPT-3.5 did not. However, all LLMs considered performed worse than humans, represented here by dental students in Korea. On average, the best-performing LLM annually achieved 85.4% of human performance.

CONCLUSION

Using the KDLE as a benchmark, our study demonstrates that although LLMs have not yet reached human-level performance in overall scores, both Claude3-Opus and ChatGPT-4 exceed the cut-off scores and perform exceptionally well in specific subjects.

CLINICAL RELEVANCE

Our findings will aid in evaluating the feasibility of integrating LLMs into dentistry to improve the quality and availability of dental services by offering patient information that meets the basic competency standards of a dentist.

摘要

目的

本研究调查了大语言模型(LLMs)在牙科教育和实践中的潜在应用,重点关注ChatGPT和Claude3-Opus。以韩国牙科执照考试(KDLE)为基准,我们旨在评估这些模型在牙科领域的能力。

方法

本研究评估了三个大语言模型:GPT-3.5、GPT-4(版本:2024年3月)和Claude3-Opus(版本:2024年3月)。我们将2019年至2023年的KDLE问卷作为输入提供给大语言模型,然后将大语言模型的输出作为相应答案。获取并比较了各个科目的总分。我们还将大语言模型的表现与参加考试的个人的表现进行了比较。

结果

在所考虑的大语言模型中,Claude3-Opus表现最佳,但在2019年ChatGPT-4表现最佳。在所有考虑的年份中,Claude3-Opus和ChatGPT-4都超过了及格分数;这表明Claude3-Opus和ChatGPT-4通过了KDLE,而ChatGPT-3.5没有。然而,所有考虑的大语言模型的表现都比人类差,这里以韩国牙科学生代表人类。平均而言,每年表现最佳的大语言模型达到了人类表现的85.4%。

结论

以KDLE为基准,我们的研究表明,尽管大语言模型在总分上尚未达到人类水平,但Claude3-Opus和ChatGPT-4都超过了及格分数,并且在特定科目中表现出色。

临床意义

我们的研究结果将有助于评估将大语言模型整合到牙科中的可行性,通过提供符合牙医基本能力标准的患者信息来提高牙科服务的质量和可及性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3931/11806296/faa66196aba4/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验