大型语言模型在中国国家牙科执照考试中的表现：ChatGPT、GPT-4和新必应的比较分析

Performance of large language models in the National Dental Licensing Examination in China: a comparative analysis of ChatGPT, GPT-4, and New Bing.

作者信息

Hu Ziyang, Xu Zhe, Shi Ping, Zhang Dandan, Yue Qu, Zhang Jiexia, Lei Xin, Lin Zitong

出版信息

Int J Comput Dent. 2024 Dec 9;27(4):401-411. doi: 10.3290/j.ijcd.b5870240.

DOI:10.3290/j.ijcd.b5870240

PMID:39651568

Abstract

AIM

The objective of the present study was to investigate the clinical understanding and reasoning abilities of large language models (LLMs); namely, ChatGPT, GPT-4, and New Bing, by evaluating their performance in the NDLE (National Dental Licensing Examination) in China.

MATERIALS AND METHODS

Questions from the NDLE from 2020 to 2022 were selected based on subject weightings. Standardized prompts were utilized to regulate the output of LLMs for acquiring more precise answers. The performance of each model across each subject category and for the subjects overall was analyzed employing the McNemar's test.

RESULTS

The percentage scores obtained by ChatGPT, GPT-4, and New Bing were 42.6% (138/324), 63.0% (204/324), and 72.5% (235/324), respectively. Significant variance was seen between the performance of New Bing compared with ChatGPT and GPT-4. GPT-4 and New Bing outperformed ChatGPT across all subjects, with New Bing surpassing GPT-4 in most subjects.

CONCLUSION

GPT-4 and New Bing exhibited promising capabilities in the NDLE. However, their performance in specific subjects such as prosthodontics and oral and maxillofacial surgery requires improvement. This performance gap can be attributed to limited dental training data and the inherent complexity of these subjects.

摘要

目的

本研究的目的是通过评估大型语言模型（LLMs），即ChatGPT、GPT-4和New Bing在中国国家牙科执照考试（NDLE）中的表现，来调查它们的临床理解和推理能力。

材料与方法

根据科目权重，选取2020年至2022年NDLE的题目。使用标准化提示来规范大型语言模型的输出，以获得更精确的答案。采用McNemar检验分析每个模型在每个科目类别以及总体科目的表现。

结果

ChatGPT、GPT-4和New Bing获得的百分比分数分别为42.6%（138/324）、63.0%（204/324）和72.5%（235/324）。与ChatGPT和GPT-4相比，New Bing的表现存在显著差异。GPT-4和New Bing在所有科目上的表现均优于ChatGPT，在大多数科目上New Bing超过了GPT-4。