Department of Orthopaedic Surgery, The University of Chicago, Chicago, IL, USA.
Clin Orthop Relat Res. 2024 Dec 1;482(12):2098-2106. doi: 10.1097/CORR.0000000000003234. Epub 2024 Sep 6.
Artificial intelligence (AI) is engineered to emulate tasks that have historically required human interaction and intellect, including learning, pattern recognition, decision-making, and problem-solving. Although AI models like ChatGPT-4 have demonstrated satisfactory performance on medical licensing exams, suggesting a potential for supporting medical diagnostics and decision-making, no study of which we are aware has evaluated the ability of these tools to make treatment recommendations when given clinical vignettes and representative medical imaging of common orthopaedic conditions. As AI continues to advance, a thorough understanding of its strengths and limitations is necessary to inform safe and helpful integration into medical practice.
QUESTIONS/PURPOSES: (1) What is the concordance between ChatGPT-4-generated treatment recommendations for common orthopaedic conditions with both the American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines (CPGs) and an orthopaedic attending physician's treatment plan? (2) In what specific areas do the ChatGPT-4-generated treatment recommendations diverge from the AAOS CPGs?
Ten common orthopaedic conditions with associated AAOS CPGs were identified: carpal tunnel syndrome, distal radius fracture, glenohumeral joint osteoarthritis, rotator cuff injury, clavicle fracture, hip fracture, hip osteoarthritis, knee osteoarthritis, ACL injury, and acute Achilles rupture. For each condition, the medical records of 10 deidentified patients managed at our facility were used to construct clinical vignettes that each had an isolated, single diagnosis with adequate clarity. The vignettes also encompassed a range of diagnostic severity to evaluate more thoroughly adherence to the treatment guidelines outlined by the AAOS. These clinical vignettes were presented alongside representative radiographic imaging. The model was prompted to provide a single treatment plan recommendation. Each treatment plan was compared with established AAOS CPGs and to the treatment plan documented by the attending orthopaedic surgeon treating the specific patient. Vignettes where ChatGPT-4 recommendations diverged from CPGs were reviewed to identify patterns of error and summarized.
ChatGPT-4 provided treatment recommendations in accordance with the AAOS CPGs in 90% (90 of 100) of clinical vignettes. Concordance between ChatGPT-generated plans and the plan recommended by the treating orthopaedic attending physician was 78% (78 of 100). One hundred percent (30 of 30) of ChatGPT-4 recommendations for fracture vignettes and hip and knee arthritis vignettes matched with CPG recommendations, whereas the model struggled most with recommendations for carpal tunnel syndrome (3 of 10 instances demonstrated discordance). ChatGPT-4 recommendations diverged from AAOS CPGs for three carpal tunnel syndrome vignettes; two ACL injury, rotator cuff injury, and glenohumeral joint osteoarthritis vignettes; as well as one acute Achilles rupture vignette. In these situations, ChatGPT-4 most often struggled to correctly interpret injury severity and progression, incorporate patient factors (such as lifestyle or comorbidities) into decision-making, and recognize a contraindication to surgery.
ChatGPT-4 can generate accurate treatment plans aligned with CPGs but can also make mistakes when it is required to integrate multiple patient factors into decision-making and understand disease severity and progression. Physicians must critically assess the full clinical picture when using AI tools to support their decision-making.
ChatGPT-4 may be used as an on-demand diagnostic companion, but patient-centered decision-making should continue to remain in the hands of the physician.
人工智能(AI)被设计用来模拟历史上需要人类交互和智力的任务,包括学习、模式识别、决策和问题解决。尽管像 ChatGPT-4 这样的 AI 模型在医学执照考试中表现出令人满意的性能,表明其有潜力支持医疗诊断和决策,但我们所知的没有研究评估这些工具在给定临床病例和常见骨科疾病的代表性医学影像时,为常见骨科疾病制定治疗建议的能力。随着人工智能的不断发展,有必要深入了解其优势和局限性,以便将其安全且有益地整合到医疗实践中。
问题/目的:(1) ChatGPT-4 为常见骨科疾病生成的治疗建议与美国骨科医师学会(AAOS)临床实践指南(CPGs)和骨科主治医生的治疗方案之间的一致性如何?(2) ChatGPT-4 生成的治疗建议在哪些具体方面与 AAOS CPGs 存在差异?
确定了 10 种常见的骨科疾病,这些疾病都有相关的 AAOS CPGs:腕管综合征、桡骨远端骨折、肩肱关节骨关节炎、肩袖损伤、锁骨骨折、髋部骨折、髋骨关节炎、膝骨关节炎、ACL 损伤和急性跟腱断裂。对于每种疾病,使用我们机构管理的 10 名匿名患者的医疗记录来构建临床病例,这些病例都有一个单独的、单一的诊断,并且有足够的清晰度。这些病例还涵盖了一系列诊断严重程度,以更全面地评估对 AAOS 概述的治疗指南的遵循情况。这些临床病例与代表性的放射影像一起呈现。模型被提示提供单一的治疗方案建议。每个治疗方案都与既定的 AAOS CPGs 进行了比较,并与治疗特定患者的主治骨科医生记录的治疗方案进行了比较。对 ChatGPT-4 建议与 CPGs 存在差异的病例进行了审查,以识别并总结错误模式。
ChatGPT-4 按照 AAOS CPGs 提供了 90%(100 个病例中的 90 个)的治疗建议。ChatGPT-4 生成的方案与主治骨科医生建议的方案之间的一致性为 78%(100 个病例中的 78 个)。ChatGPT-4 对 30 个骨折病例和髋、膝关节骨关节炎病例的建议与 CPG 建议完全一致,而模型在腕管综合征病例(10 个实例中有 3 个显示不一致)方面最具挑战性。ChatGPT-4 对三个腕管综合征病例、两个 ACL 损伤、肩袖损伤和肩肱关节骨关节炎病例以及一个急性跟腱断裂病例的建议与 AAOS CPGs 存在差异。在这些情况下,ChatGPT-4 最常难以正确解释损伤严重程度和进展、将患者因素(如生活方式或合并症)纳入决策过程以及识别手术禁忌症。
ChatGPT-4 可以生成与 CPGs 一致的准确治疗方案,但在需要将多个患者因素纳入决策过程并理解疾病严重程度和进展时也可能会出错。医生在使用 AI 工具支持其决策时,必须批判性地评估完整的临床情况。
ChatGPT-4 可以用作按需诊断伴侣,但以患者为中心的决策仍然应由医生掌握。