Suppr超能文献

评估Microsoft Copilot、GPT-4和Google Gemini在眼科领域的性能。

Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology.

作者信息

Silhadi Meziane, Nassrallah Wissam B, Mikhail David, Milad Daniel, Harissi-Dagher Mona

机构信息

Faculty of Medicine, University of Montreal, Montreal, QC, Canada.

Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal, Montreal, QC, Canada.

出版信息

Can J Ophthalmol. 2025 Feb 4. doi: 10.1016/j.jcjo.2025.01.001.

Abstract

OBJECTIVE

To evaluate the performance of large language models (LLMs), specifically Microsoft Copilot, GPT-4 (GPT-4o and GPT-4o mini), and Google Gemini (Gemini and Gemini Advanced), in answering ophthalmological questions and assessing the impact of prompting techniques on their accuracy.

DESIGN

Prospective qualitative study.

PARTICIPANTS

Microsoft Copilot, GPT-4 (GPT-4o and GPT-4o mini), and Google Gemini (Gemini and Gemini Advanced).

METHODS

A total of 300 ophthalmological questions from StatPearls were tested, covering a range of subspecialties and image-based tasks. Each question was evaluated using 2 prompting techniques: zero-shot forced prompting (prompt 1) and combined role-based and zero-shot plan-and-solve+ prompting (prompt 2).

RESULTS

With zero-shot forced prompting, GPT-4o demonstrated significantly superior overall performance, correctly answering 72.3% of questions and outperforming all other models, including Copilot (53.7%), GPT-4o mini (62.0%), Gemini (54.3%), and Gemini Advanced (62.0%) (p < 0.0001). Both Copilot and GPT-4o showed notable improvements with Prompt 2 over Prompt 1, elevating Copilot's accuracy from the lowest (53.7%) to the second highest (72.3%) among the evaluated LLMs.

CONCLUSIONS

While newer iterations of LLMs, such as GPT-4o and Gemini Advanced, outperformed their less advanced counterparts (GPT-4o mini and Gemini), this study emphasizes the need for caution in clinical applications of these models. The choice of prompting techniques significantly influences performance, highlighting the necessity for further research to refine LLMs capabilities, particularly in visual data interpretation, to ensure their safe integration into medical practice.

摘要

目的

评估大语言模型(LLMs),特别是微软Copilot、GPT-4(GPT-4o和GPT-4o mini)以及谷歌Gemini(Gemini和Gemini Advanced)在回答眼科问题方面的表现,并评估提示技术对其准确性的影响。

设计

前瞻性定性研究。

参与者

微软Copilot、GPT-4(GPT-4o和GPT-4o mini)以及谷歌Gemini(Gemini和Gemini Advanced)。

方法

对来自StatPearls的总共300个眼科问题进行测试,涵盖一系列亚专业和基于图像的任务。每个问题使用两种提示技术进行评估:零样本强制提示(提示1)和基于角色与零样本计划与解决相结合的提示(提示2)。

结果

在零样本强制提示下,GPT-4o总体表现显著更优,正确回答了72.3%的问题,优于所有其他模型,包括Copilot(53.7%)、GPT-4o mini(62.0%)、Gemini(54.3%)和Gemini Advanced(62.0%)(p<0.0001)。Copilot和GPT-4o在提示2下相对于提示1都有显著改进,将Copilot的准确率从评估的大语言模型中最低的(53.7%)提升到了第二高的(72.3%)。

结论

虽然像GPT-4o和Gemini Advanced这样的大语言模型的更新版本优于其不太先进的同类产品(GPT-4o mini和Gemini),但本研究强调在这些模型的临床应用中需要谨慎。提示技术的选择对性能有显著影响,凸显了进一步研究以完善大语言模型能力的必要性,特别是在视觉数据解释方面,以确保它们安全地融入医疗实践。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验