Hosseini-Monfared Pooya, Amiri Shayan, Mirahmadi Alireza, Shahbazi Amirhossein, Alamian Aliasghar, Azizi Mohammad, Kazemi Seyed Morteza
Bone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Bone and Joint Reconstruction Research Center, Department of Orthopedics, School of Medicine, Iran University of Medical Sciences, Tehran, Iran.
Arch Acad Emerg Med. 2025 Apr 5;13(1):e42. doi: 10.22037/aaemj.v13i1.2580. eCollection 2025.
INTRODUCTION: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. METHODS: Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order. RESULTS: In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998-0.999) for accuracy scores and 0.99 (95% CI, 0.990-0.995) for clarity scores. CONCLUSION: Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise.
引言:ChatGPT是一种通用语言模型,并非专门针对医学应用进行优化。本研究旨在评估ChatGPT-4和o1-preview在为急诊环境中常见的踝关节疼痛病例生成鉴别诊断方面的表现。 方法:通过与一位经验丰富的骨科医生进行会诊,并查阅相关医院和社交媒体资料,确定踝关节疼痛的常见表现。为了模拟典型的患者咨询,问题采用简单、非专业的语言编写,要求针对每种情况给出三种可能的鉴别诊断。第二阶段涉及设计反映分诊护士或医生典型场景的病例 vignettes。根据两位经验丰富的骨科医生建立的基准对ChatGPT的回答进行评估,采用评分系统根据鉴别诊断的顺序评估其准确性、清晰度和相关性。 结果:在21例踝关节疼痛表现中,ChatGPT-o1 preview在准确性和清晰度方面均优于ChatGPT-4,仅清晰度得分达到统计学意义(p < 0.001)。ChatGPT-o1 preview的总分也显著更高(p = 0.004)。在15个病例 vignettes中,ChatGPT-o1 preview在诊断和管理清晰度方面得分更高,尽管诊断准确性的差异无统计学意义。在51个问题中,ChatGPT-4和ChatGPT-o1 preview分别对5个(9.8%)和4个(7.8%)问题给出了错误回答。评分者间信度分析表明评分系统具有出色的信度,准确性得分的组内相关系数为0.99(95% CI,0.998 - 0.999),清晰度得分的组内相关系数为0.99(95% CI,0.990 - 0.995)。 结论:我们的研究结果表明,ChatGPT-4和ChatGPT-o1 preview在急诊环境中踝关节疼痛病例的分诊中均表现出可接受的性能。ChatGPT-o1 preview优于ChatGPT-4,提供了更清晰、更精确的回答。虽然这两种模型都显示出作为辅助工具的潜力,但其作用应保持受监督状态,并且严格作为临床专业知识的补充。
Arch Acad Emerg Med. 2025-4-5
Diagn Interv Radiol. 2025-5-12
J Med Internet Res. 2024-7-8
Indian J Surg Oncol. 2025-2
Arch Acad Emerg Med. 2024-7-30
J Med Internet Res. 2024-7-8
J Bone Joint Surg Am. 2024-8-7
Int J Sports Phys Ther. 2024-2-1
J Med Internet Res. 2023-10-4
BMC Med Educ. 2023-9-22