Suppr超能文献

Gemini Advanced与ChatGPT 4.0在眼科住院医师眼科知识评估计划(OKAP)考试复习题库中的表现比较。

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.

作者信息

Gill Gurnoor S, Tsai Joby, Moxam Jillene, Sanghvi Harshal A, Gupta Shailesh

机构信息

Medical School, Florida Atlantic University Charles E. Schmidt College of Medicine, Boca Raton, USA.

Ophthalmology, Broward Health, Fort Lauderdale, USA.

出版信息

Cureus. 2024 Sep 17;16(9):e69612. doi: 10.7759/cureus.69612. eCollection 2024 Sep.

Abstract

Background With advancements in natural language processing, tools such as Chat Generative Pre-Trained Transformers (ChatGPT) version 4.0 and Google Bard's Gemini Advanced are being increasingly evaluated for their potential in various medical applications. The objective of this study was to systematically assess the performance of these language learning models (LLMs) on both image and non-image-based questions within the specialized field of Ophthalmology. We used a review question bank for the Ophthalmic Knowledge Assessment Program (OKAP) used by ophthalmology residents nationally to prepare for the Ophthalmology Board Exam to assess the accuracy and performance of ChatGPT and Gemini Advanced. Methodology A total of 260 randomly generated multiple-choice questions from the OphthoQuestions question bank were run through ChatGPT and Gemini Advanced. A simulated 260-question OKAP examination was created at random from the bank. Question-specific data were analyzed, including overall percent correct, subspecialty accuracy, whether the question was "high yield," difficulty (1-4), and question type (e.g., image, text). To compare the performance of ChatGPT-4 and Gemini on the difficulty of questions, we utilized the standard deviation of user answer choices to determine question difficulty. In this study, a statistical analysis of Google Sheets was conducted using two-tailed t-tests with unequal variance to compare the performance of ChatGPT-4.0 and Google's Gemini Advanced across various question types, subspecialties, and difficulty levels. Results In total, 259 of the 260 questions were included in the study as one question used a video that any form of ChatGPT could not interpret as of May 1, 2024. For text-only questions, ChatGPT-4.0.0 correctly answered 57.14% (148/259, p < 0.018), and Gemini Advanced correctly answered 46.72% (121/259, p < 0.018). Both versions answered most questions without a prompt and would have received a below-average score on the OKAP. Moreover, there were 27 questions utilizing a secondary prompt in ChatGPT-4.0 compared to 67 questions in Gemini Advanced. ChatGPT-4.0 performed 68.99% on easier questions (<2 on a scale from 1-4) and 44.96% on harder questions (>2 on a scale from 1-4). On the other hand, Gemini Advanced performed 49.61% on easier questions (<2 on a scale from 1-4) and 44.19% on harder questions (>2 on a scale from 1-4). There was a statistically significant difference in accuracy between ChatGPT-4.0 and Gemini Advanced for easy (p < 0.0015) but not for hard (p < 0.55) questions. For image-only questions, ChatGPT-4.0 correctly answered 39.58% (19/48, p < 0.013), and Gemini Advanced correctly answered 33.33% (16/48, p < 0.022), resulting in a statistically insignificant difference in accuracy between ChatGPT-4.0 and Gemini Advanced (p < 0.530). A comparison against text-only and image-based questions demonstrated a statistically significant difference in accuracy for both ChatGPT-4.0 (p < 0.013) and Gemini Advanced (p < 0.022). Conclusions This study provides evidence that ChatGPT-4.0 performs better on the OKAP-style exams and is improved compared to Gemini Advanced within the context of ophthalmic multiple-choice questions. This may show an opportunity for increased worth for ChatGPT in ophthalmic medical education. While showing promise within medical education, caution should be used as a more detailed evaluation of reliability is needed.

摘要

背景 随着自然语言处理技术的进步,诸如Chat生成式预训练变换器(ChatGPT)4.0版和谷歌巴德的Gemini Advanced等工具在各种医学应用中的潜力正得到越来越多的评估。本研究的目的是系统评估这些语言学习模型(LLMs)在眼科专业领域基于图像和非图像问题上的表现。我们使用了全国眼科住院医师用于准备眼科委员会考试的眼科知识评估项目(OKAP)的复习题库,来评估ChatGPT和Gemini Advanced的准确性和表现。

方法 从眼科问题题库中随机生成的总共260道多项选择题通过ChatGPT和Gemini Advanced进行测试。从该题库中随机创建了一场模拟的260道题的OKAP考试。对特定问题的数据进行了分析,包括总体正确百分比、亚专业准确性、问题是否为“高收益”、难度(1 - 4)以及问题类型(如图像、文本)。为了比较ChatGPT - 4和Gemini在问题难度上的表现,我们利用用户答案选项的标准差来确定问题难度。在本研究中,使用谷歌表格进行了统计分析,采用不等方差双尾t检验来比较ChatGPT - 4.0和谷歌的Gemini Advanced在各种问题类型、亚专业和难度水平上的表现。

结果 在260道题中,共有259道题被纳入研究,因为有一道题使用了截至2024年5月1日任何形式的ChatGPT都无法解读的视频。对于纯文本问题,ChatGPT - 4.0正确回答了57.14%(148/259,p < 0.018),Gemini Advanced正确回答了46.72%(121/259,p < 0.018)。两个版本在大多数问题上无需提示就能回答,但在OKAP上都会得到低于平均水平的分数。此外,ChatGPT - 4.0中有27道题使用了二次提示,而Gemini Advanced中有67道题使用了二次提示。ChatGPT - 4.0在较简单问题(1 - 4评分中<2)上的正确率为68.99%,在较难问题(1 - 4评分中>2)上的正确率为44.96%。另一方面,Gemini Advanced在较简单问题(1 - 4评分中<2)上的正确率为49.61%,在较难问题(1 - 4评分中>2)上的正确率为44.19%。ChatGPT - 4.0和Gemini Advanced在简单问题上的准确性存在统计学显著差异(p < 0.0015),但在难题上不存在显著差异(p < 0.55)。对于纯图像问题,ChatGPT - 4.0正确回答了39.58%(19/48,p < 0.013),Gemini Advanced正确回答了33.33%(16/48,p < 0.022),ChatGPT - 4.0和Gemini Advanced在准确性上的差异无统计学意义(p < 0.530)。与纯文本和基于图像的问题相比,ChatGPT - 4.0(p < 0.013)和Gemini Advanced(p < 0.022)在准确性上均存在统计学显著差异。

结论 本研究提供的证据表明,在眼科多项选择题的背景下,ChatGPT - 4.0在OKAP风格的考试中表现更好,且比Gemini Advanced有所改进。这可能表明ChatGPT在眼科医学教育中的价值有增加的机会。虽然在医学教育中显示出了前景,但由于需要对可靠性进行更详细的评估,所以应谨慎使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ada/11486483/515c21b7f5ab/cureus-0016-00000069612-i01.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验