人类与 GPT-4.0 和 GPT-3.5 在眼科协会自我评估计划中的比较表现。

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.

机构信息

Department of Ophthalmology, University Magna Graecia of Catanzaro, Catanzaro, Italy.

Department of Clinical Sciences and Translational Medicine, University of Rome Tor Vergata, Rome, Italy.

出版信息

Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.

DOI:10.1038/s41598-023-45837-2

PMID:37899405

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10613606/

Abstract

To compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the American Academy of Ophthalmology (AAO) Basic and Clinical Science Course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments . In June 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. The AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. All questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). Out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P < 0.0001). Both GPT-4.0 and GPT-3.5 showed the worst results in surgery-related questions (74.6% and 57.0% respectively). For difficult questions (answered incorrectly by > 50% of humans), both GPT models favorably compared to humans, without reaching significancy. The word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 ± 56 and 206 ± 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. However, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.

摘要

为了比较人类、GPT-4.0 和 GPT-3.5 在回答美国眼科学会 (AAO) 基础和临床科学课程 (BCSC) 自我评估计划中的多项选择题的表现，可在以下网址获取该计划：https://www.aao.org/education/self-assessments。2023 年 6 月，向 GPT-4.0 和 GPT-3.5 提交了基于文本的多项选择题。AAO 提供了选择正确答案的人类百分比，对此进行了分析以作比较。所有问题都按照 10 个亚专科和 3 个实践领域（诊断/临床、医学治疗、手术）进行了分类。在 1023 个问题中，GPT-4.0 的得分最高（82.4%），其次是人类（75.7%）和 GPT-3.5（65.9%），准确率差异显著（总是 P < 0.0001）。GPT-4.0 和 GPT-3.5 在与手术相关的问题上表现最差（分别为 74.6%和 57.0%）。对于难度较大的问题（答错的人类超过 50%），GPT 模型与人类相比表现更好，但无统计学意义。GPT-4.0 提供的答案字数明显少于 GPT-3.5（分别为 160 ± 56 和 206 ± 77，P < 0.0001）；然而，错误的回答更长（P < 0.02）。GPT-4.0 相较于 GPT-3.5 有显著提升，在 AAO BCSC 自我评估测试中的表现优于人类。然而，ChatGPT 仍然受到不同实践领域不一致性的限制，尤其是在手术方面。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ccc2/10613606/b9dce85735f5/41598_2023_45837_Fig1_HTML.jpg

相似文献

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.人类与 GPT-4.0 和 GPT-3.5 在眼科协会自我评估计划中的比较表现。

Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.

Evaluating the Artificial Intelligence Performance Growth in Ophthalmic Knowledge.评估眼科知识领域中人工智能性能的增长情况。

Cureus. 2023 Sep 21;15(9):e45700. doi: 10.7759/cureus.45700. eCollection 2023 Sep.

Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study.ChatGPT 在不同考试级别的眼科相关问题上的表现：观察性研究。

JMIR Med Educ. 2024 Jan 18;10:e50842. doi: 10.2196/50842.

Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering.GPT-4 在眼科领域的能力：对模型熵的分析及迈向人类水平医学问答的进展。

Br J Ophthalmol. 2024 Sep 20;108(10):1371-1378. doi: 10.1136/bjo-2023-324438.

Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions.眼科中的人工智能：GPT-3.5、GPT-4与人类专家回答StatPearls问题的比较分析

Cureus. 2023 Jun 22;15(6):e40822. doi: 10.7759/cureus.40822. eCollection 2023 Jun.

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.ChatGPT和GPT-4在神经外科笔试中的表现。

Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.

The Performance of ChatGPT on the American Society for Surgery of the Hand Self-Assessment Examination.ChatGPT在美国手外科协会自我评估考试中的表现。

Cureus. 2024 Apr 24;16(4):e58950. doi: 10.7759/cureus.58950. eCollection 2024 Apr.

Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023.大语言模型应用于心胸外科手术：2023年四种模型在美国胸外科医师委员会考试题目上的性能对比分析

Cureus. 2024 Jul 22;16(7):e65083. doi: 10.7759/cureus.65083. eCollection 2024 Jul.

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。

Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.

Performance of ChatGPT on Nephrology Test Questions.ChatGPT 在肾病学试题上的表现。

Clin J Am Soc Nephrol. 2024 Jan 1;19(1):35-43. doi: 10.2215/CJN.0000000000000330. Epub 2023 Oct 18.

引用本文的文献

Application prospect of large language model represented by ChatGPT in ophthalmology.以ChatGPT为代表的大语言模型在眼科领域的应用前景

Int J Ophthalmol. 2025 Sep 18;18(9):1790-1796. doi: 10.18240/ijo.2025.09.21. eCollection 2025.

Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators.眼科领域的大语言模型：关于其对临床医生、研究人员、患者和教育工作者的效用的范围综述

Eye (Lond). 2025 Aug 25. doi: 10.1038/s41433-025-03935-7.

Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.评估ChatGPT在眼科板型考试问题上的表现：一项荟萃分析。

J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7.

Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study.医学生、ChatGPT-3.5和ChatGPT-4.0在回答巴西国家医学考试问题中的表现比较：横断面问卷调查研究

JMIR AI. 2025 May 8;4:e66552. doi: 10.2196/66552.

Comparative analysis of ChatGPT 3.5 and ChatGPT 4 obstetric and gynecological knowledge.ChatGPT 3.5与ChatGPT 4妇产科知识的对比分析

Sci Rep. 2025 Jul 1;15(1):21133. doi: 10.1038/s41598-025-08424-1.

Large language models provide discordant information compared to ophthalmology guidelines.与眼科指南相比，大语言模型提供的信息不一致。

Sci Rep. 2025 Jul 1;15(1):20556. doi: 10.1038/s41598-025-06404-z.

Chatbot for the Return of Positive Genetic Screening Results for Hereditary Cancer Syndromes: Prompt Engineering Project.遗传性癌症综合征阳性基因筛查结果返回的聊天机器人：提示工程设计项目

JMIR Cancer. 2025 Jun 10;11:e65848. doi: 10.2196/65848.

Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists.公开可用的大语言模型在角膜疾病中的诊断性能：与人类专家的比较

Diagnostics (Basel). 2025 May 13;15(10):1221. doi: 10.3390/diagnostics15101221.

Quality and reliability of pediatric pneumonia related short videos on mainstream platforms: cross-sectional study.主流平台上儿科肺炎相关短视频的质量与可靠性：横断面研究

BMC Public Health. 2025 May 23;25(1):1896. doi: 10.1186/s12889-025-22963-2.

Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.通过角色扮演提示增强大语言模型的回答：关于全膝关节置换术常见问题解答的比较研究

BMC Med Inform Decis Mak. 2025 May 23;25(1):196. doi: 10.1186/s12911-025-03024-5.

本文引用的文献

Modern threats in academia: evaluating plagiarism and artificial intelligence detection scores of ChatGPT.学术界的现代威胁：评估ChatGPT的抄袭和人工智能检测得分

Eye (Lond). 2024 Feb;38(2):397-400. doi: 10.1038/s41433-023-02678-7. Epub 2023 Aug 2.

Cureus. 2023 Jun 22;15(6):e40822. doi: 10.7759/cureus.40822. eCollection 2023 Jun.

Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现：比较研究。

JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现：对其优缺点的分析。

Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

GPT-4 for triaging ophthalmic symptoms.用于眼科症状分诊的GPT-4

Eye (Lond). 2023 Dec;37(18):3874-3875. doi: 10.1038/s41433-023-02595-9. Epub 2023 May 25.

ChatGPT and scientific abstract writing: pitfalls and caution.ChatGPT与科学摘要写作：陷阱与注意事项。

Graefes Arch Clin Exp Ophthalmol. 2023 Nov;261(11):3205-3206. doi: 10.1007/s00417-023-06123-z. Epub 2023 May 25.

ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations.医学领域的ChatGPT：其应用、优势、局限性、未来前景及伦理考量概述

Front Artif Intell. 2023 May 4;6:1169595. doi: 10.3389/frai.2023.1169595. eCollection 2023.

Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams.皇家眼科医学院院士资格考试中大型语言模型的对比分析

Eye (Lond). 2023 Dec;37(17):3530-3533. doi: 10.1038/s41433-023-02563-3. Epub 2023 May 9.

Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination.GPT-3.5、GPT-4与人类用户在眼科笔试模拟考试中的表现比较。

Eye (Lond). 2023 Dec;37(17):3694-3695. doi: 10.1038/s41433-023-02564-2. Epub 2023 May 8.

ChatGPT and Ophthalmology: Exploring Its Potential with Discharge Summaries and Operative Notes.ChatGPT 与眼科学：从出院小结和手术记录探索其潜力。

Semin Ophthalmol. 2023 Jul;38(5):503-507. doi: 10.1080/08820538.2023.2209166. Epub 2023 May 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

人类与 GPT-4.0 和 GPT-3.5 在眼科协会自我评估计划中的比较表现。

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献