Suppr超能文献

人类与 GPT-4.0 和 GPT-3.5 在眼科协会自我评估计划中的比较表现。

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.

机构信息

Department of Ophthalmology, University Magna Graecia of Catanzaro, Catanzaro, Italy.

Department of Clinical Sciences and Translational Medicine, University of Rome Tor Vergata, Rome, Italy.

出版信息

Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.

Abstract

To compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the American Academy of Ophthalmology (AAO) Basic and Clinical Science Course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments . In June 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. The AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. All questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). Out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P < 0.0001). Both GPT-4.0 and GPT-3.5 showed the worst results in surgery-related questions (74.6% and 57.0% respectively). For difficult questions (answered incorrectly by > 50% of humans), both GPT models favorably compared to humans, without reaching significancy. The word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 ± 56 and 206 ± 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. However, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.

摘要

为了比较人类、GPT-4.0 和 GPT-3.5 在回答美国眼科学会 (AAO) 基础和临床科学课程 (BCSC) 自我评估计划中的多项选择题的表现,可在以下网址获取该计划:https://www.aao.org/education/self-assessments。2023 年 6 月,向 GPT-4.0 和 GPT-3.5 提交了基于文本的多项选择题。AAO 提供了选择正确答案的人类百分比,对此进行了分析以作比较。所有问题都按照 10 个亚专科和 3 个实践领域(诊断/临床、医学治疗、手术)进行了分类。在 1023 个问题中,GPT-4.0 的得分最高(82.4%),其次是人类(75.7%)和 GPT-3.5(65.9%),准确率差异显著(总是 P < 0.0001)。GPT-4.0 和 GPT-3.5 在与手术相关的问题上表现最差(分别为 74.6%和 57.0%)。对于难度较大的问题(答错的人类超过 50%),GPT 模型与人类相比表现更好,但无统计学意义。GPT-4.0 提供的答案字数明显少于 GPT-3.5(分别为 160 ± 56 和 206 ± 77,P < 0.0001);然而,错误的回答更长(P < 0.02)。GPT-4.0 相较于 GPT-3.5 有显著提升,在 AAO BCSC 自我评估测试中的表现优于人类。然而,ChatGPT 仍然受到不同实践领域不一致性的限制,尤其是在手术方面。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ccc2/10613606/b9dce85735f5/41598_2023_45837_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验