University of Cagliari, Cagliari, Italy.
PLoS One. 2024 Oct 23;19(10):e0308157. doi: 10.1371/journal.pone.0308157. eCollection 2024.
This article reports the results of an experiment conducted with ChatGPT to see how its performance compares to human performance on tests that require specific knowledge and skills, such as university admission tests. We chose a general undergraduate admission test and two tests for admission to biomedical programs: the Scholastic Assessment Test (SAT), the Cambridge BioMedical Admission Test (BMAT), and the Italian Medical School Admission Test (IMSAT). In particular, we looked closely at the difference in performance between ChatGPT-4 and its predecessor, ChatGPT-3.5, to assess its evolution. The performance of ChatGPT-4 showed a significant improvement over ChatGPT-3.5 and, compared to real students, was on average within the top 10% in the SAT test, while the score in the IMSAT test granted admission to the two highest ranked Italian medical schools. In addition to the performance analysis, we provide a qualitative analysis of incorrect answers and a classification of three different types of logical and computational errors made by ChatGPT-4, which reveal important weaknesses of the model. This provides insight into the skills needed to use these models effectively despite their weaknesses, and also suggests possible applications of our analysis in the field of education.
这篇文章报告了使用 ChatGPT 进行的实验结果,以了解其在需要特定知识和技能的测试中的表现如何,例如大学入学考试。我们选择了一项普通本科入学考试和两项生物医学专业入学考试:学术能力评估测试(SAT)、剑桥生物医学入学考试(BMAT)和意大利医学院入学考试(IMSAT)。特别是,我们仔细研究了 ChatGPT-4 与其前身 ChatGPT-3.5 之间的性能差异,以评估其进化情况。ChatGPT-4 的表现明显优于 ChatGPT-3.5,与真实学生相比,在 SAT 考试中平均成绩在前 10%之列,而 IMSAT 考试的成绩则被两所排名最高的意大利医学院录取。除了性能分析,我们还对错误答案进行了定性分析,并对 ChatGPT-4 犯的三种不同类型的逻辑和计算错误进行了分类,这揭示了该模型的重要弱点。这为我们提供了一些见解,即尽管存在弱点,但仍需要使用这些模型的技能,也为我们的分析在教育领域的可能应用提供了思路。