Department of Ophthalmology, University Magna Graecia of Catanzaro, Catanzaro, Italy.
Department of Clinical Sciences and Translational Medicine, University of Rome Tor Vergata, Rome, Italy.
Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.
To compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the American Academy of Ophthalmology (AAO) Basic and Clinical Science Course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments . In June 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. The AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. All questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). Out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P < 0.0001). Both GPT-4.0 and GPT-3.5 showed the worst results in surgery-related questions (74.6% and 57.0% respectively). For difficult questions (answered incorrectly by > 50% of humans), both GPT models favorably compared to humans, without reaching significancy. The word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 ± 56 and 206 ± 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. However, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.
为了比较人类、GPT-4.0 和 GPT-3.5 在回答美国眼科学会 (AAO) 基础和临床科学课程 (BCSC) 自我评估计划中的多项选择题的表现,可在以下网址获取该计划:https://www.aao.org/education/self-assessments。2023 年 6 月,向 GPT-4.0 和 GPT-3.5 提交了基于文本的多项选择题。AAO 提供了选择正确答案的人类百分比,对此进行了分析以作比较。所有问题都按照 10 个亚专科和 3 个实践领域(诊断/临床、医学治疗、手术)进行了分类。在 1023 个问题中,GPT-4.0 的得分最高(82.4%),其次是人类(75.7%)和 GPT-3.5(65.9%),准确率差异显著(总是 P < 0.0001)。GPT-4.0 和 GPT-3.5 在与手术相关的问题上表现最差(分别为 74.6%和 57.0%)。对于难度较大的问题(答错的人类超过 50%),GPT 模型与人类相比表现更好,但无统计学意义。GPT-4.0 提供的答案字数明显少于 GPT-3.5(分别为 160 ± 56 和 206 ± 77,P < 0.0001);然而,错误的回答更长(P < 0.02)。GPT-4.0 相较于 GPT-3.5 有显著提升,在 AAO BCSC 自我评估测试中的表现优于人类。然而,ChatGPT 仍然受到不同实践领域不一致性的限制,尤其是在手术方面。