Hu Ziyang, Xu Zhe, Shi Ping, Zhang Dandan, Yue Qu, Zhang Jiexia, Lei Xin, Lin Zitong
Int J Comput Dent. 2024 Dec 9;27(4):401-411. doi: 10.3290/j.ijcd.b5870240.
The objective of the present study was to investigate the clinical understanding and reasoning abilities of large language models (LLMs); namely, ChatGPT, GPT-4, and New Bing, by evaluating their performance in the NDLE (National Dental Licensing Examination) in China.
Questions from the NDLE from 2020 to 2022 were selected based on subject weightings. Standardized prompts were utilized to regulate the output of LLMs for acquiring more precise answers. The performance of each model across each subject category and for the subjects overall was analyzed employing the McNemar's test.
The percentage scores obtained by ChatGPT, GPT-4, and New Bing were 42.6% (138/324), 63.0% (204/324), and 72.5% (235/324), respectively. Significant variance was seen between the performance of New Bing compared with ChatGPT and GPT-4. GPT-4 and New Bing outperformed ChatGPT across all subjects, with New Bing surpassing GPT-4 in most subjects.
GPT-4 and New Bing exhibited promising capabilities in the NDLE. However, their performance in specific subjects such as prosthodontics and oral and maxillofacial surgery requires improvement. This performance gap can be attributed to limited dental training data and the inherent complexity of these subjects.
本研究的目的是通过评估大型语言模型(LLMs),即ChatGPT、GPT-4和New Bing在中国国家牙科执照考试(NDLE)中的表现,来调查它们的临床理解和推理能力。
根据科目权重,选取2020年至2022年NDLE的题目。使用标准化提示来规范大型语言模型的输出,以获得更精确的答案。采用McNemar检验分析每个模型在每个科目类别以及总体科目的表现。
ChatGPT、GPT-4和New Bing获得的百分比分数分别为42.6%(138/324)、63.0%(204/324)和72.5%(235/324)。与ChatGPT和GPT-4相比,New Bing的表现存在显著差异。GPT-4和New Bing在所有科目上的表现均优于ChatGPT,在大多数科目上New Bing超过了GPT-4。
GPT-4和New Bing在NDLE中展现出了有前景的能力。然而,它们在诸如口腔修复学和口腔颌面外科等特定科目上的表现需要改进。这种表现差距可归因于有限的牙科训练数据以及这些科目的内在复杂性。