Suppr超能文献

探索ChatGPT在骨科环境中的表现及其作为教育工具的潜在用途。

Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool.

作者信息

Drouaud Arthur, Stocchi Carolina, Tang Justin, Gonsalves Grant, Cheung Zoe, Szatkowski Jan, Forsh David

机构信息

George Washington University School of Medicine, Washington, District of Columbia.

Department of Orthopaedic Surgery, Mount Sinai, New York, New York.

出版信息

JB JS Open Access. 2024 Nov 26;9(4). doi: 10.2106/JBJS.OA.24.00081. eCollection 2024 Oct-Dec.

Abstract

INTRODUCTION

We assessed ChatGPT-4 vision (GPT-4V)'s performance for image interpretation, diagnosis formulation, and patient management capabilities. We aim to shed light on its potential as an educational tool addressing real-life cases for medical students.

METHODS

Ten of the most popular orthopaedic trauma cases from OrthoBullets were selected. GPT-4V interpreted medical imaging and patient information, providing diagnoses, and guiding responses to OrthoBullets questions. Four fellowship-trained orthopaedic trauma surgeons rated GPT-4V responses using a 5-point Likert scale (strongly disagree to strongly agree). Each of GPT-4V's answers was assessed for alignment with current medical knowledge (accuracy), rationale and whether it is logical (rationale), relevancy to the specific case (relevance), and whether surgeons would trust the answers (trustworthiness). Mean scores from surgeon ratings were calculated.

RESULTS

In total, 10 clinical cases, comprising 97 questions, were analyzed (10 imaging, 35 management, and 52 treatment). The surgeons assigned a mean overall rating of 3.46/5.00 to GPT-4V's imaging response (accuracy 3.28, rationale 3.68, relevance 3.75, and trustworthiness 3.15). Management questions received an overall score of 3.76 (accuracy 3.61, rationale 3.84, relevance 4.01, and trustworthiness 3.58), while treatment questions had an average overall score of 4.04 (accuracy 3.99, rationale 4.08, relevance 4.15, and trustworthiness 3.93).

CONCLUSION

This is the first study evaluating GPT-4V's imaging interpretation, personalized management, and treatment approaches as a medical educational tool. Surgeon ratings indicate overall fair agreement in GPT-4V reasoning behind decision-making. GPT-4V performed less favorably in imaging interpretation compared with its management and treatment approach performance. The performance of GPT-4V falls below our fellowship-trained orthopaedic trauma surgeon's standards as a standalone tool for medical education.

摘要

引言

我们评估了ChatGPT-4视觉模型(GPT-4V)在图像解读、诊断制定和患者管理能力方面的表现。我们旨在阐明其作为一种针对医学生实际病例的教育工具的潜力。

方法

从OrthoBullets中挑选出10个最常见的骨科创伤病例。GPT-4V解读医学影像和患者信息,给出诊断,并指导对OrthoBullets问题的回答。四位接受过专科培训的骨科创伤外科医生使用5点李克特量表(从强烈不同意到强烈同意)对GPT-4V的回答进行评分。评估GPT-4V的每个答案与当前医学知识的一致性(准确性)、理由以及是否合乎逻辑(合理性)、与特定病例的相关性(相关性),以及外科医生是否会信任这些答案(可信度)。计算外科医生评分的平均分。

结果

总共分析了10个临床病例,包括97个问题(10个影像问题、35个管理问题和52个治疗问题)。外科医生对GPT-4V影像回答的总体平均评分为3.46/5.00(准确性3.28、合理性3.68、相关性3.75、可信度3.15)。管理问题的总体评分为3.76(准确性3.61、合理性3.84、相关性4.01、可信度3.58),而治疗问题的平均总体评分为4.04(准确性3.99、合理性4.08、相关性4.15、可信度3.93)。

结论

这是第一项评估GPT-4V作为医学教育工具在影像解读、个性化管理和治疗方法方面表现的研究。外科医生的评分表明,对GPT-4V决策背后的推理总体上有相当程度的认同。与管理和治疗方法的表现相比,GPT-4V在影像解读方面的表现较差。作为医学教育的独立工具,GPT-4V的表现低于我们接受过专科培训的骨科创伤外科医生的标准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96da/11584220/35c48bfc09f3/jbjsoa-9-e24.00081-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验