大语言模型回答眼科考试式问题的准确性：一项荟萃分析。

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

机构信息

Edward S. Harkness Eye Institute, Department of Ophthalmology, Columbia University Irving Medical Center, New York, NY 10032, USA; Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA.

Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA.

出版信息

Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5.

DOI:10.1016/j.apjo.2024.100106

PMID:39374807

Abstract

PURPOSE

To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.

DESIGN

Meta-analysis.

METHODS

Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.

RESULTS

Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95 % CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95 % CI: 0.48-0.56]).

CONCLUSIONS

The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.

摘要

目的

评估大型语言模型（LLM）回答眼科 board-style 问题的准确性。

设计

荟萃分析。

方法

于 2024 年 3 月，通过 PubMed 和 Embase 进行文献检索。我们纳入了以英文发表的报告 LLM 回答眼科 board-style 问题准确性的全文文章和研究快报。从每项研究中提取每个问题集的 LLM 性能数据，包括提交的问题数量和生成的正确答案。使用随机效应模型计算汇总准确性。根据使用的 LLM 和评估的特定眼科主题进行亚组分析。

结果

在检索到的 14 项研究中，有 13 项（93%）在多个眼科主题上测试了 LLM。ChatGPT-3.5、ChatGPT-4、Bard 和 Bing Chat 在 12 项（86%）、11 项（79%）、4 项（29%）和 4 项（29%）研究中进行了评估。LLM 的总体汇总准确性为 0.65（95%CI：0.61-0.69）。在不同的 LLM 中，ChatGPT-4 的汇总准确性最高，为 0.74（95%CI：0.73-0.79），而 ChatGPT-3.5 的最低，为 0.52（95%CI：0.51-0.54）。LLM 在“病理学”方面表现最好（0.78[95%CI：0.70-0.86]），在“眼科基本原则和原理”方面表现最差（0.52[95%CI：0.48-0.56]）。

结论

LLM 回答眼科 board-style 问题的整体准确性可以接受，但并不出色，ChatGPT-4 和 Bing Chat 是表现最好的模型。性能根据测试的特定眼科主题而有显著差异。不一致的表现令人担忧，这凸显了未来研究需要纳入带有图像的眼科 board-style 问题，以更全面地检查 LLM 的能力。

相似文献

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5.

A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.

Indian Dermatol Online J. 2025 Feb 27;16(2):241-247. doi: 10.4103/idoj.idoj_221_24. eCollection 2025 Mar-Apr.

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.

Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.

Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study.

Sci Rep. 2024 Aug 14;14(1):18859. doi: 10.1038/s41598-024-68996-2.

Evaluating the Efficacy of AI Chatbots as Tutors in Urology: A Comparative Analysis of Responses to the 2022 In-Service Assessment of the European Board of Urology.

Urol Int. 2024;108(4):359-366. doi: 10.1159/000537854. Epub 2024 Mar 30.

Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study.

JMIR Med Educ. 2025 Jan 13;11:e58898. doi: 10.2196/58898.

Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.

JMIR Med Educ. 2024 Oct 3;10:e52746. doi: 10.2196/52746.

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.

Cureus. 2024 Sep 17;16(9):e69612. doi: 10.7759/cureus.69612. eCollection 2024 Sep.

引用本文的文献

Performance of ChatGPT-4 Omni and Gemini 1.5 Pro on Ophthalmology-Related Questions in the Turkish Medical Specialty Exam.

Turk J Ophthalmol. 2025 Aug 21;55(4):177-185. doi: 10.4274/tjo.galenos.2025.27895.

Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.

J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7.

Evaluating the Accuracy of Gemini 2.0 Advanced and ChatGPT 4o in Cataract Knowledge: A Performance Analysis Using Brazilian Council of Ophthalmology Board Exam Questions.

Cureus. 2025 Feb 24;17(2):e79565. doi: 10.7759/cureus.79565. eCollection 2025 Feb.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型回答眼科考试式问题的准确性：一项荟萃分析。

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

机构信息

Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA.

出版信息

Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5.

DOI:10.1016/j.apjo.2024.100106

PMID:39374807

Abstract

PURPOSE

To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.

大语言模型回答眼科考试式问题的准确性：一项荟萃分析。

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

机构信息

出版信息

PURPOSE

DESIGN

METHODS

RESULTS

CONCLUSIONS

目的

设计

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

大语言模型回答眼科考试式问题的准确性：一项荟萃分析。

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

机构信息

出版信息

PURPOSE

DESIGN

METHODS

RESULTS

CONCLUSIONS

目的

设计

方法

结果

结论

相似文献

引用本文的文献