Drouaud Arthur, Stocchi Carolina, Tang Justin, Gonsalves Grant, Cheung Zoe, Szatkowski Jan, Forsh David
George Washington University School of Medicine, Washington, District of Columbia.
Department of Orthopaedic Surgery, Mount Sinai, New York, New York.
JB JS Open Access. 2024 Nov 26;9(4). doi: 10.2106/JBJS.OA.24.00081. eCollection 2024 Oct-Dec.
We assessed ChatGPT-4 vision (GPT-4V)'s performance for image interpretation, diagnosis formulation, and patient management capabilities. We aim to shed light on its potential as an educational tool addressing real-life cases for medical students.
Ten of the most popular orthopaedic trauma cases from OrthoBullets were selected. GPT-4V interpreted medical imaging and patient information, providing diagnoses, and guiding responses to OrthoBullets questions. Four fellowship-trained orthopaedic trauma surgeons rated GPT-4V responses using a 5-point Likert scale (strongly disagree to strongly agree). Each of GPT-4V's answers was assessed for alignment with current medical knowledge (accuracy), rationale and whether it is logical (rationale), relevancy to the specific case (relevance), and whether surgeons would trust the answers (trustworthiness). Mean scores from surgeon ratings were calculated.
In total, 10 clinical cases, comprising 97 questions, were analyzed (10 imaging, 35 management, and 52 treatment). The surgeons assigned a mean overall rating of 3.46/5.00 to GPT-4V's imaging response (accuracy 3.28, rationale 3.68, relevance 3.75, and trustworthiness 3.15). Management questions received an overall score of 3.76 (accuracy 3.61, rationale 3.84, relevance 4.01, and trustworthiness 3.58), while treatment questions had an average overall score of 4.04 (accuracy 3.99, rationale 4.08, relevance 4.15, and trustworthiness 3.93).
This is the first study evaluating GPT-4V's imaging interpretation, personalized management, and treatment approaches as a medical educational tool. Surgeon ratings indicate overall fair agreement in GPT-4V reasoning behind decision-making. GPT-4V performed less favorably in imaging interpretation compared with its management and treatment approach performance. The performance of GPT-4V falls below our fellowship-trained orthopaedic trauma surgeon's standards as a standalone tool for medical education.
我们评估了ChatGPT-4视觉模型(GPT-4V)在图像解读、诊断制定和患者管理能力方面的表现。我们旨在阐明其作为一种针对医学生实际病例的教育工具的潜力。
从OrthoBullets中挑选出10个最常见的骨科创伤病例。GPT-4V解读医学影像和患者信息,给出诊断,并指导对OrthoBullets问题的回答。四位接受过专科培训的骨科创伤外科医生使用5点李克特量表(从强烈不同意到强烈同意)对GPT-4V的回答进行评分。评估GPT-4V的每个答案与当前医学知识的一致性(准确性)、理由以及是否合乎逻辑(合理性)、与特定病例的相关性(相关性),以及外科医生是否会信任这些答案(可信度)。计算外科医生评分的平均分。
总共分析了10个临床病例,包括97个问题(10个影像问题、35个管理问题和52个治疗问题)。外科医生对GPT-4V影像回答的总体平均评分为3.46/5.00(准确性3.28、合理性3.68、相关性3.75、可信度3.15)。管理问题的总体评分为3.76(准确性3.61、合理性3.84、相关性4.01、可信度3.58),而治疗问题的平均总体评分为4.04(准确性3.99、合理性4.08、相关性4.15、可信度3.93)。
这是第一项评估GPT-4V作为医学教育工具在影像解读、个性化管理和治疗方法方面表现的研究。外科医生的评分表明,对GPT-4V决策背后的推理总体上有相当程度的认同。与管理和治疗方法的表现相比,GPT-4V在影像解读方面的表现较差。作为医学教育的独立工具,GPT-4V的表现低于我们接受过专科培训的骨科创伤外科医生的标准。