Suppr超能文献

大语言模型回答眼科考试式问题的准确性:一项荟萃分析。

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

机构信息

Edward S. Harkness Eye Institute, Department of Ophthalmology, Columbia University Irving Medical Center, New York, NY 10032, USA; Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA.

Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA.

出版信息

Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5.

Abstract

PURPOSE

To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.

DESIGN

Meta-analysis.

METHODS

Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.

RESULTS

Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95 % CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95 % CI: 0.48-0.56]).

CONCLUSIONS

The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.

摘要

目的

评估大型语言模型(LLM)回答眼科 board-style 问题的准确性。

设计

荟萃分析。

方法

于 2024 年 3 月,通过 PubMed 和 Embase 进行文献检索。我们纳入了以英文发表的报告 LLM 回答眼科 board-style 问题准确性的全文文章和研究快报。从每项研究中提取每个问题集的 LLM 性能数据,包括提交的问题数量和生成的正确答案。使用随机效应模型计算汇总准确性。根据使用的 LLM 和评估的特定眼科主题进行亚组分析。

结果

在检索到的 14 项研究中,有 13 项(93%)在多个眼科主题上测试了 LLM。ChatGPT-3.5、ChatGPT-4、Bard 和 Bing Chat 在 12 项(86%)、11 项(79%)、4 项(29%)和 4 项(29%)研究中进行了评估。LLM 的总体汇总准确性为 0.65(95%CI:0.61-0.69)。在不同的 LLM 中,ChatGPT-4 的汇总准确性最高,为 0.74(95%CI:0.73-0.79),而 ChatGPT-3.5 的最低,为 0.52(95%CI:0.51-0.54)。LLM 在“病理学”方面表现最好(0.78[95%CI:0.70-0.86]),在“眼科基本原则和原理”方面表现最差(0.52[95%CI:0.48-0.56])。

结论

LLM 回答眼科 board-style 问题的整体准确性可以接受,但并不出色,ChatGPT-4 和 Bing Chat 是表现最好的模型。性能根据测试的特定眼科主题而有显著差异。不一致的表现令人担忧,这凸显了未来研究需要纳入带有图像的眼科 board-style 问题,以更全面地检查 LLM 的能力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验