Liu Mingxin, Okuhara Tsuyoshi, Huang Wenbo, Ogihara Atsushi, Nagao Hikari Sophia, Okada Hiroko, Kiuchi Takahiro
Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Bunkyo, Tokyo, Japan.
Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Bunkyo, Tokyo, Japan.
Int Dent J. 2025 Feb;75(1):213-222. doi: 10.1016/j.identj.2024.10.014. Epub 2024 Nov 12.
This study systematically reviews and conducts a meta-analysis to evaluate the performance of various large language models (LLMs) in dental licensing examinations worldwide. The aim is to assess the accuracy of these models in different linguistic and geographical contexts. This will inform their potential application in dental education and diagnostics.
Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a comprehensive search across PubMed, Web of Science, and Scopus for studies published from 1 January 2022 to 1 May 2024. Two authors independently reviewed the literature based on the inclusion and exclusion criteria, extracted data, and evaluated the quality of the studies in accordance with the Quality Assessment of Diagnostic Accuracy Studies-2. We conducted qualitative and quantitative analyses to evaluate the performance of LLMs.
Eleven studies met the inclusion criteria, encompassing dental licensing examinations from eight countries. GPT-3.5, GPT-4, and Bard achieved integrated accuracy rates of 54%, 72%, and 56%, respectively. GPT-4 outperformed GPT-3.5 and Bard, passing more than half of the dental licensing examinations. Subgroup analyses and meta-regression showed that GPT-3.5 performed significantly better in English-speaking countries. GPT-4's performance, however, remained consistent across different regions.
LLMs, particularly GPT-4, show potential in dental education and diagnostics, yet their accuracy remains below the threshold required for clinical application. The lack of sufficient training data in dentistry has affected LLMs' accuracy. The reliance on image-based diagnostics also presents challenges. As a result, their accuracy in dental exams is lower compared to medical licensing exams. Additionally, LLMs even provide more detailed explanation for incorrect answer than correct one. Overall, the current LLMs are not yet suitable for use in dental education and clinical diagnosis.
本研究系统回顾并进行荟萃分析,以评估全球牙科执照考试中各种大语言模型(LLM)的表现。目的是评估这些模型在不同语言和地理环境下的准确性。这将为它们在牙科教育和诊断中的潜在应用提供参考。
按照系统评价与荟萃分析的首选报告项目指南,我们在PubMed、科学网和Scopus上全面检索了2022年1月1日至2024年5月1日发表的研究。两名作者根据纳入和排除标准独立审查文献,提取数据,并根据诊断准确性研究质量评估-2评估研究质量。我们进行了定性和定量分析,以评估大语言模型的表现。
11项研究符合纳入标准,涵盖来自8个国家的牙科执照考试。GPT-3.5、GPT-4和Bard的综合准确率分别为54%、72%和56%。GPT-4的表现优于GPT-3.5和Bard,通过了超过一半的牙科执照考试。亚组分析和元回归表明,GPT-3.5在英语国家的表现明显更好。然而,GPT-4在不同地区的表现保持一致。
大语言模型,尤其是GPT-4,在牙科教育和诊断中显示出潜力,但其准确性仍低于临床应用所需的阈值。牙科领域缺乏足够的训练数据影响了大语言模型的准确性。对基于图像的诊断的依赖也带来了挑战。因此,它们在牙科考试中的准确性低于医学执照考试。此外,大语言模型甚至对错误答案的解释比对正确答案的解释更详细。总体而言,当前的大语言模型尚不适用于牙科教育和临床诊断。