评估 GPT-3.5 和 GPT-4 在波兰医学期末考试中的表现。

Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination.

机构信息

Faculty of Mechatronics, Institute of Metrology and Biomedical Engineering, Warsaw University of Technology, Boboli 8 Street, 02-525, Warsaw, Poland.

Department of Pediatric Cardiology and General Pediatrics, Medical University of Warsaw, Warsaw, Poland.

出版信息

Sci Rep. 2023 Nov 22;13(1):20512. doi: 10.1038/s41598-023-46995-z.

DOI:10.1038/s41598-023-46995-z

PMID:37993519

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10665355/

Abstract

The study aimed to evaluate the performance of two Large Language Models (LLMs): ChatGPT (based on GPT-3.5) and GPT-4 with two temperature parameter values, on the Polish Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023 in two language versions-English and Polish. The accuracies of both models were compared and the relationships between the correctness of answers with the answer's metrics were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations regardless of the language used. GPT-4 achieved mean accuracies of 79.7% for both Polish and English versions, passing all MFE versions. GPT-3.5 had mean accuracies of 54.8% for Polish and 60.3% for English, passing none and 2 of 3 Polish versions for temperature parameter equal to 0 and 1 respectively while passing all English versions regardless of the temperature parameter value. GPT-4 score was mostly lower than the average score of a medical student. There was a statistically significant correlation between the correctness of the answers and the index of difficulty for both models. The overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings. These findings suggest an increasing potential for the usage of LLMs in terms of medical education.

摘要

本研究旨在评估两种大型语言模型（LLMs）：ChatGPT（基于 GPT-3.5）和 GPT-4，在波兰医学期末考试（MFE）中，有两个温度参数值。在三种 MFE 版本（2022 年春季、2022 年秋季和 2023 年春季）上，以英语和波兰语两种语言版本对模型进行了测试。比较了这两个模型的准确性，并研究了答案的准确性与答案指标之间的关系。研究表明，无论使用哪种语言，GPT-4 在所有三个考试中都优于 GPT-3.5。GPT-4 在波兰语和英语版本的平均准确率分别为 79.7%，通过了所有 MFE 版本。GPT-3.5 在波兰语中的平均准确率为 54.8%，在英语中的平均准确率为 60.3%，分别通过了 0 和 1 的温度参数的所有英语版本，而通过了所有 3 个波兰语版本中的 2 个版本，而 GPT-3.5 的分数大多低于医学生的平均分数。对于两个模型，答案的准确性与难度指数之间存在统计学上的显著相关性。两个模型的整体准确性仍然不太理想，低于医学生的平均水平。这强调了在将 LLM 可靠地部署在医疗环境之前，需要进一步改进 LLM。这些发现表明，在医学教育方面，LLM 的使用潜力越来越大。