Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy.
Br J Ophthalmol. 2024 Sep 20;108(10):1457-1469. doi: 10.1136/bjo-2023-325143.
We aimed to define the capability of three different publicly available large language models, Chat Generative Pretrained Transformer (ChatGPT-3.5), ChatGPT-4 and Google Gemini in analysing retinal detachment cases and suggesting the best possible surgical planning.
Analysis of 54 retinal detachments records entered into ChatGPT and Gemini's interfaces. After asking 'Specify what kind of surgical planning you would suggest and the eventual intraocular tamponade.' and collecting the given answers, we assessed the level of agreement with the common opinion of three expert vitreoretinal surgeons. Moreover, ChatGPT and Gemini answers were graded 1-5 (from poor to excellent quality), according to the Global Quality Score (GQS).
After excluding 4 controversial cases, 50 cases were included. Overall, ChatGPT-3.5, ChatGPT-4 and Google Gemini surgical choices agreed with those of vitreoretinal surgeons in 40/50 (80%), 42/50 (84%) and 35/50 (70%) of cases. Google Gemini was not able to respond in five cases. Contingency analysis showed significant differences between ChatGPT-4 and Gemini (p=0.03). ChatGPT's GQS were 3.9±0.8 and 4.2±0.7 for versions 3.5 and 4, while Gemini scored 3.5±1.1. There was no statistical difference between the two ChatGPTs (p=0.22), while both outperformed Gemini scores (p=0.03 and p=0.002, respectively). The main source of error was endotamponade choice (14% for ChatGPT-3.5 and 4, and 12% for Google Gemini). Only ChatGPT-4 was able to suggest a combined phacovitrectomy approach.
In conclusion, Google Gemini and ChatGPT evaluated vitreoretinal patients' records in a coherent manner, showing a good level of agreement with expert surgeons. According to the GQS, ChatGPT's recommendations were much more accurate and precise.
我们旨在定义三个不同的大型语言模型(Chat Generative Pretrained Transformer (ChatGPT-3.5)、ChatGPT-4 和 Google Gemini)在分析视网膜脱离病例和提出最佳手术计划方面的能力。
在 ChatGPT 和 Gemini 的界面中分析了 54 例视网膜脱离病例记录。在询问“指定您建议的哪种手术计划和最终的眼内填塞”并收集给出的答案后,我们评估了与三位专家玻璃体视网膜外科医生的共识程度。此外,根据全球质量评分(GQS),将 ChatGPT 和 Gemini 的答案评为 1-5 级(从差到优)。
排除 4 例有争议的病例后,共纳入 50 例病例。总体而言,ChatGPT-3.5、ChatGPT-4 和 Google Gemini 的手术选择在 40/50(80%)、42/50(84%)和 35/50(70%)的病例中与玻璃体视网膜外科医生的选择一致。Google Gemini 无法回答 5 例病例。卡方检验显示 ChatGPT-4 和 Gemini 之间存在显著差异(p=0.03)。ChatGPT 的 GQS 分别为 3.9±0.8 和 4.2±0.7,用于版本 3.5 和 4,而 Gemini 的得分为 3.5±1.1。两个 ChatGPT 之间没有统计学差异(p=0.22),而这两个模型的得分均优于 Gemini(p=0.03 和 p=0.002)。主要的错误来源是内眼内填塞选择(ChatGPT-3.5 和 4 为 14%,Google Gemini 为 12%)。只有 ChatGPT-4 能够建议联合白内障玻璃体切除术方法。
综上所述,Google Gemini 和 ChatGPT 以一致的方式评估了玻璃体视网膜患者的记录,与专家外科医生的意见高度一致。根据 GQS,ChatGPT 的建议更准确和精确。