Chen Chih-Hsiung, Hsieh Kuang-Yu, Huang Kuo-En, Lai Hsien-Yun
Department of Critical Care Medicine, Mennonite Christian Hospital, Hualien City, TWN.
Department of Education and Research, Mennonite Christian Hospital, Hualien City, TWN.
Cureus. 2024 Aug 23;16(8):e67641. doi: 10.7759/cureus.67641. eCollection 2024 Aug.
Introduction The latest generation of large language models (LLMs) features multimodal capabilities, allowing them to interpret graphics, images, and videos, which are crucial in medical fields. This study investigates the vision capabilities of the next-generation Generative Pre-trained Transformer 4 (GPT-4) and Google's Gemini. Methods To establish a comparative baseline, we used GPT-3.5, a model limited to text processing, and evaluated the performance of both GPT-4 and Gemini on questions from the Taiwan Specialist Board Exams in Pulmonary and Critical Care Medicine. Our dataset included 1,100 questions from 2012 to 2023, with 100 questions per year. Of these, 1,059 were in pure text and 41 were text with images, with the majority in a non-English language and only six in pure English. Results For each annual exam consisting of 100 questions from 2013 to 2023, GPT-4 achieved scores of 66, 69, 51, 64, 72, 64, 66, 64, 63, 68, and 67, respectively. Gemini scored 45, 48, 45, 45, 46, 59, 54, 41, 53, 45, and 45, while GPT-3.5 scored 39, 33, 35, 36, 32, 33, 43, 28, 32, 33, and 36. Conclusions These results demonstrate that the newer LLMs with vision capabilities significantly outperform the text-only model. When a passing score of 60 was set, GPT-4 passed most exams and approached human performance.
引言 最新一代的大语言模型(LLMs)具有多模态能力,使其能够解读图形、图像和视频,这在医学领域至关重要。本研究调查了下一代生成式预训练变换器4(GPT-4)和谷歌的Gemini的视觉能力。
方法 为了建立一个比较基线,我们使用了仅限于文本处理的模型GPT-3.5,并评估了GPT-4和Gemini在台湾肺脏与重症医学专科医师考试问题上的表现。我们的数据集包括2012年至2023年的1100个问题,每年100个问题。其中,1059个是纯文本问题,41个是带图像的文本问题,大多数问题使用非英语语言,只有6个是纯英语问题。
结果 对于2013年至2023年每年由100个问题组成的考试,GPT-4的得分分别为66、69、51、64、72、64、66、64、63、68和67。Gemini的得分分别为45、48、45、45、46、59、54、41、53、45和45,而GPT-3.5的得分分别为39、33、35、36、32、33、43、28、32、33和36。
结论 这些结果表明,具有视觉能力的更新型大语言模型明显优于仅支持文本的模型。当设定及格分数为60分时,GPT-4通过了大多数考试,且接近人类表现。