GPT-4 视觉模型在肾脏病理考题上的表现。

Performance of GPT-4 Vision on kidney pathology exam questions.

机构信息

Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN, US.

Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, US.

出版信息

Am J Clin Pathol. 2024 Sep 3;162(3):220-226. doi: 10.1093/ajcp/aqae030.

DOI:10.1093/ajcp/aqae030

PMID:38567909

Abstract

OBJECTIVES

ChatGPT (OpenAI, San Francisco, CA) has shown impressive results across various medical examinations, but its performance in kidney pathology is not yet established. This study evaluated proficiencies of GPT-4 Vision (GPT-4V), an updated version of the platform with the ability to analyze images, on kidney pathology questions and compared its responses with those of nephrology trainees.

METHODS

Thirty-nine questions (19 text-based questions and 20 with various kidney biopsy images) designed specifically for the training of nephrology fellows were employed.

RESULTS

GPT-4V displayed comparable accuracy rates in the first and second runs (67% and 72%, respectively, P = .50). The aggregated accuracy, however-particularly, the consistent accuracy-of GPT-4V was lower than that of trainees (72% and 67% vs 79%). Both GPT-4V and trainees displayed comparable accuracy in responding to image-based and text-only questions (55% vs 79% and 81% vs 78%, P = .11 and P = .67, respectively). The consistent accuracy in image-based, directly asked questions for GPT-4V was 29%, much lower than its 88% consistency on text-only, directly asked questions (P = .02). In contrast, trainees maintained similar accuracy in directly asked image-based and text-based questions (80% vs 77%, P = .65). Although the aggregated accuracy for correctly interpreting images was 69%, the consistent accuracy across both runs was only 39%. The accuracy of GPT-4V in answering questions with correct image interpretation was significantly higher than for questions with incorrect image interpretation (100% vs 0% and 100% vs 33% for the first and second runs, P = .001 and P = .02, respectively).

CONCLUSIONS

The performance of GPT-4V in handling kidney pathology questions, especially those including images, is limited. There is a notable need for enhancement in GPT-4V proficiency in interpreting images.

摘要

目的

ChatGPT（OpenAI，旧金山，加利福尼亚州）在各种医学检查中表现出色，但它在肾脏病理学中的表现尚未确定。本研究评估了 GPT-4 Vision（GPT-4V）的能力，GPT-4V 是该平台的更新版本，能够分析图像，用于肾脏病理问题，并将其反应与肾病受训者进行比较。

方法

使用专门为肾病住院医师培训设计的 39 个问题（19 个基于文本的问题和 20 个带有各种肾脏活检图像的问题）。

结果

GPT-4V 在第一轮和第二轮的准确率相当（分别为 67%和 72%，P=0.50）。然而，GPT-4V 的综合准确率，特别是一致性准确率，低于受训者（72%和 67%与 79%）。GPT-4V 和受训者在回答基于图像和仅基于文本的问题时的准确率相当（55%与 79%和 81%与 78%，P=0.11 和 P=0.67，分别）。GPT-4V 在直接询问基于图像的问题上的一致性准确率为 29%，远低于其在仅基于文本的直接询问问题上的 88%一致性（P=0.02）。相比之下，受训者在直接询问基于图像和基于文本的问题上保持相似的准确率（80%与 77%，P=0.65）。虽然正确解释图像的综合准确率为 69%，但两轮的一致性准确率仅为 39%。GPT-4V 在回答正确解释图像的问题上的准确率明显高于回答图像解释错误的问题（第一轮和第二轮的 100%与 0%和 100%与 33%，P=0.001 和 P=0.02，分别）。