评估ChatGPT在眼科领域的表现：对其优缺点的分析。

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.

作者信息

Antaki Fares, Touma Samir, Milad Daniel, El-Khoury Jonathan, Duval Renaud

机构信息

Department of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada.

Centre Universitaire d'Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l'Est-de-l'Île-de-Montréal, Montréal, Quebec, Canada.

出版信息

Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

DOI:10.1016/j.xops.2023.100324

PMID:37334036

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10272508/

Abstract

PURPOSE

Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space.

DESIGN

Evaluation of diagnostic test or technology.

PARTICIPANTS

ChatGPT is a publicly available LLM.

METHODS

We tested 2 versions of ChatGPT (January 9 "legacy" and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey's test to decide if there were meaningful differences between the tested subspecialties.

MAIN OUTCOME MEASURES

We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT's outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a value of < 0.05.

RESULTS

The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; = 0.006) followed by question difficulty (LR, 24.05; < 0.001) were most predictive of ChatGPT's answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology ( < 0.001) and ocular pathology ( = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections.

CONCLUSION

ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties.

FINANCIAL DISCLOSURES

Proprietary or commercial disclosure may be found after the references.

摘要

目的

基础模型是一种新型人工智能算法，其中模型在无标注数据上进行大规模预训练，并针对诸如生成文本等众多下游任务进行微调。本研究评估了大型语言模型（LLM）ChatGPT在眼科问答领域的准确性。

设计

诊断测试或技术的评估。

参与者

ChatGPT是一个可公开获取的LLM。

方法

我们在两个常用于准备高风险眼科知识评估计划（OKAP）考试的流行多项选择题库上测试了ChatGPT的两个版本（1月9日的“旧版”和ChatGPT Plus）。我们从基础与临床科学课程（BCSC）自我评估计划和OphthoQuestions在线题库中生成了两场各有260道题的模拟考试。我们进行逻辑回归以确定考试部分、认知水平和难度指数对答案准确性的影响。我们还使用Tukey检验进行事后分析，以确定测试的亚专业之间是否存在有意义的差异。

主要观察指标

通过将ChatGPT的输出与题库提供的答案键进行比较，我们报告了ChatGPT在每个考试部分的正确百分比准确率。我们用似然比（LR）卡方展示逻辑回归结果。我们认为当p值<0.05时，考试部分之间的差异具有统计学意义。

结果

旧版模型在BCSC题库上的准确率为55.8%，在OphthoQuestions题库上为42.7%。使用ChatGPT Plus时，准确率分别提高到59.4%±0.6%和49.2%±1.0%。在控制考试部分和认知水平时，较简单的问题准确率更高。对旧版模型的逻辑回归分析表明，考试部分（LR，27.57；p = 0.006）其次是问题难度（LR，24.05；p < 0.001）对ChatGPT的答案准确性预测性最强。尽管旧版模型在普通医学方面表现最佳，在神经眼科（p < 0.001）和眼部病理学（p = 0.029）方面表现最差，但ChatGPT Plus未出现类似的事后分析结果，表明各考试部分的结果更一致。

结论

ChatGPT在模拟OKAP考试中表现出令人鼓舞的成绩。通过特定领域的预训练对LLMs进行专业化处理可能是提高其在眼科亚专业表现的必要条件。

财务披露

专有或商业披露信息可在参考文献之后找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef91/10272508/6bf4a521ed08/gr1.jpg

相似文献

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.

Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.

Cureus. 2024 Sep 17;16(9):e69612. doi: 10.7759/cureus.69612. eCollection 2024 Sep.

Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment.

JAMA Ophthalmol. 2023 Jun 1;141(6):589-597. doi: 10.1001/jamaophthalmol.2023.1144.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.

Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.

Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering.

Br J Ophthalmol. 2024 Sep 20;108(10):1371-1378. doi: 10.1136/bjo-2023-324438.

Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.

JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.

Development and Evaluation of Aeyeconsult: A Novel Ophthalmology Chatbot Leveraging Verified Textbook Knowledge and GPT-4.

J Surg Educ. 2024 Mar;81(3):438-443. doi: 10.1016/j.jsurg.2023.11.019. Epub 2023 Dec 21.

ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.

Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.

引用本文的文献

Reducing Hallucinations and Trade-Offs in Responses in Generative AI Chatbots for Cancer Information: Development and Evaluation Study.

JMIR Cancer. 2025 Sep 11;11:e70176. doi: 10.2196/70176.

DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans.

JAMA Ophthalmol. 2025 Sep 4. doi: 10.1001/jamaophthalmol.2025.2918.

Clinical decision-making for uveal melanoma radiotherapy: comparative performance of experienced radiation oncologists and leading generative AI models.

Front Oncol. 2025 Aug 14;15:1605916. doi: 10.3389/fonc.2025.1605916. eCollection 2025.

Assessing the accuracy, repeatability, and consistency of ChatGPT 4o in treatment planning for tooth-supported fixed prostheses: a comparative analysis of simple and complex clinical cases.

Clin Oral Investig. 2025 Sep 2;29(9):433. doi: 10.1007/s00784-025-06521-z.

Comparative evaluation of AI platforms "Google Gemini 2.5 Flash, Google Gemini 2.0 Flash, DeepSeek V3 and ChatGPT 4o" in solving multiple-choice questions from different subtopics of anatomy.

Surg Radiol Anat. 2025 Aug 30;47(1):193. doi: 10.1007/s00276-025-03707-8.

Systematic Review on Large Language Models in Orthopaedic Surgery.

J Clin Med. 2025 Aug 20;14(16):5876. doi: 10.3390/jcm14165876.

Identification and Categorization of the Top 100 Articles and the Future of Large Language Models: Thematic Analysis Using Bibliometric Analysis.

JMIR AI. 2025 Aug 27;4:e68603. doi: 10.2196/68603.

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

Endosc Int Open. 2025 Aug 6;13:a26372163. doi: 10.1055/a-2637-2163. eCollection 2025.

Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators.

Eye (Lond). 2025 Aug 25. doi: 10.1038/s41433-025-03935-7.

Performance of ChatGPT-4 Omni and Gemini 1.5 Pro on Ophthalmology-Related Questions in the Turkish Medical Specialty Exam.

Turk J Ophthalmol. 2025 Aug 21;55(4):177-185. doi: 10.4274/tjo.galenos.2025.27895.

本文引用的文献

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.

Proc Conf Empir Methods Nat Lang Process. 2022 Dec;2022:3876-3887. doi: 10.18653/v1/2022.emnlp-main.256.

Can large language models reason about medical questions?

Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

On the Opportunities and Risks of Foundation Models for Natural Language Processing in Radiology.

Radiol Artif Intell. 2022 Jul 20;4(4):e220119. doi: 10.1148/ryai.220119. eCollection 2022 Jul.

New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology.

Br J Ophthalmol. 2022 Jul;106(7):889-892. doi: 10.1136/bjophthalmol-2022-321141. Epub 2022 May 6.

Accuracy of automated machine learning in classifying retinal pathologies from ultra-widefield pseudocolour fundus images.

Br J Ophthalmol. 2023 Jan;107(1):90-95. doi: 10.1136/bjophthalmol-2021-319030. Epub 2021 Aug 3.

Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery.

NPJ Digit Med. 2021 Jun 3;4(1):93. doi: 10.1038/s41746-021-00464-x.

Referral Patterns in Neuro-Ophthalmology.

J Neuroophthalmol. 2020 Dec;40(4):485-493. doi: 10.1097/WNO.0000000000000846.

Resident and program characteristics that impact performance on the Ophthalmic Knowledge Assessment Program (OKAP).

BMC Med Educ. 2019 Jun 7;19(1):190. doi: 10.1186/s12909-019-1637-4.

Artificial intelligence and deep learning in ophthalmology.

Br J Ophthalmol. 2019 Feb;103(2):167-175. doi: 10.1136/bjophthalmol-2018-313173. Epub 2018 Oct 25.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估ChatGPT在眼科领域的表现：对其优缺点的分析。

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.

作者信息

机构信息

出版信息

PURPOSE

DESIGN

PARTICIPANTS

METHODS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSION

FINANCIAL DISCLOSURES

目的

设计

参与者

方法

主要观察指标

结果

结论

财务披露

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献