Edward S. Harkness Eye Institute, Department of Ophthalmology, Columbia University Irving Medical Center, New York, NY 10032, USA; Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA.
Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA.
Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5.
To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.
Meta-analysis.
Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.
Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95 % CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95 % CI: 0.48-0.56]).
The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.
评估大型语言模型(LLM)回答眼科 board-style 问题的准确性。
荟萃分析。
于 2024 年 3 月,通过 PubMed 和 Embase 进行文献检索。我们纳入了以英文发表的报告 LLM 回答眼科 board-style 问题准确性的全文文章和研究快报。从每项研究中提取每个问题集的 LLM 性能数据,包括提交的问题数量和生成的正确答案。使用随机效应模型计算汇总准确性。根据使用的 LLM 和评估的特定眼科主题进行亚组分析。
在检索到的 14 项研究中,有 13 项(93%)在多个眼科主题上测试了 LLM。ChatGPT-3.5、ChatGPT-4、Bard 和 Bing Chat 在 12 项(86%)、11 项(79%)、4 项(29%)和 4 项(29%)研究中进行了评估。LLM 的总体汇总准确性为 0.65(95%CI:0.61-0.69)。在不同的 LLM 中,ChatGPT-4 的汇总准确性最高,为 0.74(95%CI:0.73-0.79),而 ChatGPT-3.5 的最低,为 0.52(95%CI:0.51-0.54)。LLM 在“病理学”方面表现最好(0.78[95%CI:0.70-0.86]),在“眼科基本原则和原理”方面表现最差(0.52[95%CI:0.48-0.56])。
LLM 回答眼科 board-style 问题的整体准确性可以接受,但并不出色,ChatGPT-4 和 Bing Chat 是表现最好的模型。性能根据测试的特定眼科主题而有显著差异。不一致的表现令人担忧,这凸显了未来研究需要纳入带有图像的眼科 board-style 问题,以更全面地检查 LLM 的能力。