生成式大型语言模型在眼科 Board 式问题中的表现。

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.

机构信息

From the Bascom Palmer Eye Institute, Miami, Florida, USA (L.Z.C., A.S., J.S.Y., N.Y., C.A.).

出版信息

Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.

DOI:10.1016/j.ajo.2023.05.024

PMID:37339728

Abstract

PURPOSE

To investigate the ability of generative artificial intelligence models to answer ophthalmology board-style questions.

DESIGN

Experimental study.

METHODS

This study evaluated 3 large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science Self-Assessment Program. Although ChatGPT is trained on information last updated in 2021, Bing Chat incorporates a more recently indexed internet search to generate its answers. Performance was compared with human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or nonlogical reasoning were documented.

MAIN OUTCOME MEASURES

Primary outcome was response accuracy. Secondary outcomes were performance in question subcategories and hallucination frequency.

RESULTS

Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), whereas ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (odds ratio [OR], 3.89, 95% CI, 1.19-14.73, P = .03) compared with diagnostic questions, but struggled with image interpretation (OR, 0.14, 95% CI, 0.05-0.33, P < .01) when compared with single-step reasoning questions. Against single-step questions, Bing Chat also faced difficulties with image interpretation (OR, 0.18, 95% CI, 0.08-0.44, P < .01) and multi-step reasoning (OR, 0.30, 95% CI, 0.11-0.84, P = .02). ChatGPT-3.5 had the highest rate of hallucinations and nonlogical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%).

CONCLUSIONS

LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the Basic Science and Clinical Science Self-Assessment Program. The frequency of hallucinations and nonlogical reasoning suggests room for improvement in the performance of conversational agents in the medical domain.

摘要

目的

研究生成式人工智能模型回答眼科 board-style 问题的能力。

设计

实验研究。

方法

本研究评估了 3 个具有聊天界面的大型语言模型（LLMs），即 Bing Chat（微软）和 ChatGPT 3.5 和 4.0（OpenAI），使用了来自基础科学和临床科学自我评估计划的 250 个问题。虽然 ChatGPT 是基于 2021 年最后更新的信息进行训练的，但 Bing Chat 结合了最近索引的互联网搜索来生成其答案。将性能与人类受访者进行了比较。问题按复杂性和患者护理阶段进行分类，并记录了信息捏造或非逻辑推理的实例。

主要观察指标

主要观察指标是回答的准确性。次要观察指标是问题子类别和幻觉频率的表现。

结果

人类受访者的平均准确率为 72.2%。ChatGPT-3.5 的得分最低（58.8%），而 ChatGPT-4.0（71.6%）和 Bing Chat（71.2%）的表现相当。ChatGPT-4.0 在工作流程类型的问题上表现出色（比值比[OR]，3.89，95%置信区间[CI]，1.19-14.73，P =.03），而在诊断问题上表现不佳，但在与单步推理问题相比，图像解释方面存在困难（OR，0.14，95%CI，0.05-0.33，P <.01）。与单步问题相比，Bing Chat 也在图像解释（OR，0.18，95%CI，0.08-0.44，P <.01）和多步推理（OR，0.30，95%CI，0.11-0.84，P =.02）方面遇到困难。ChatGPT-3.5 出现幻觉和非逻辑推理的比例最高（42.4%），其次是 ChatGPT-4.0（18.0%）和 Bing Chat（25.6%）。

结论

LLMs（特别是 ChatGPT-4.0 和 Bing Chat）可以与回答基础科学和临床科学自我评估计划问题的人类受访者表现相当。幻觉和非逻辑推理的频率表明，对话代理在医疗领域的性能还有改进的空间。

相似文献

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.

Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.

ChatGPT-3.5 and Bing Chat in ophthalmology: an updated evaluation of performance, readability, and informative sources.

Eye (Lond). 2024 Jul;38(10):1897-1902. doi: 10.1038/s41433-024-03037-w. Epub 2024 Mar 20.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5.

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.

Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.

Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.

Cureus. 2024 Sep 17;16(9):e69612. doi: 10.7759/cureus.69612. eCollection 2024 Sep.

Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.

Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.

Clinical application potential of large language model: a study based on thyroid nodules.

Endocrine. 2025 Jan;87(1):206-213. doi: 10.1007/s12020-024-03981-3. Epub 2024 Jul 30.

引用本文的文献

Clinical decision-making for uveal melanoma radiotherapy: comparative performance of experienced radiation oncologists and leading generative AI models.

Front Oncol. 2025 Aug 14;15:1605916. doi: 10.3389/fonc.2025.1605916. eCollection 2025.

Application prospect of large language model represented by ChatGPT in ophthalmology.

Int J Ophthalmol. 2025 Sep 18;18(9):1790-1796. doi: 10.18240/ijo.2025.09.21. eCollection 2025.

Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators.

Eye (Lond). 2025 Aug 25. doi: 10.1038/s41433-025-03935-7.

Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.

BMC Med Educ. 2025 Aug 23;25(1):1193. doi: 10.1186/s12909-025-07856-7.

ChatGPT-4o and OpenAI-o1: A Comparative Analysis of Its Accuracy in Refractive Surgery.

J Clin Med. 2025 Jul 22;14(15):5175. doi: 10.3390/jcm14155175.

The application of artificial intelligence-generated content in ophthalmology education.

Front Med (Lausanne). 2025 Jul 18;12:1617537. doi: 10.3389/fmed.2025.1617537. eCollection 2025.

Evaluating ChatGPT-4 Plus in Ophthalmology: Effect of Image Recognition and Domain-Specific Pretraining on Diagnostic Performance.

Diagnostics (Basel). 2025 Jul 19;15(14):1820. doi: 10.3390/diagnostics15141820.

Enhanced pre-recruitment framework for clinical trial questionnaires through the integration of large language models and knowledge graphs.

Sci Rep. 2025 Jul 28;15(1):27398. doi: 10.1038/s41598-025-11876-0.

Evaluating multiple large language models on orbital diseases.

Front Cell Dev Biol. 2025 Jul 7;13:1574378. doi: 10.3389/fcell.2025.1574378. eCollection 2025.

OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.

Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

生成式大型语言模型在眼科 Board 式问题中的表现。

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.

机构信息

From the Bascom Palmer Eye Institute, Miami, Florida, USA (L.Z.C., A.S., J.S.Y., N.Y., C.A.).

出版信息

Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.

DOI:10.1016/j.ajo.2023.05.024

PMID:37339728

Abstract

PURPOSE

To investigate the ability of generative artificial intelligence models to answer ophthalmology board-style questions.

DESIGN

Experimental study.

METHODS

MAIN OUTCOME MEASURES

Primary outcome was response accuracy. Secondary outcomes were performance in question subcategories and hallucination frequency.

RESULTS

CONCLUSIONS

摘要

目的

研究生成式人工智能模型回答眼科 board-style 问题的能力。

设计

实验研究。

方法

主要观察指标

主要观察指标是回答的准确性。次要观察指标是问题子类别和幻觉频率的表现。

生成式大型语言模型在眼科 Board 式问题中的表现。

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.

机构信息

出版信息

PURPOSE

DESIGN

METHODS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

目的

设计

方法

主要观察指标

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

生成式大型语言模型在眼科 Board 式问题中的表现。

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.

机构信息

出版信息

PURPOSE

DESIGN

METHODS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

目的

设计

方法

主要观察指标

结果

结论

相似文献

引用本文的文献