Quinn Matthew, Milner John D, Schmitt Phillip, Morrissey Patrick, Lemme Nicholas, Marcaccio Stephen, DeFroda Steven, Tabaddor Ramin, Owens Brett D
Department of Orthopaedics, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A..
Department of Orthopaedics, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A.
Arthroscopy. 2025 Jun;41(6):2002-2008. doi: 10.1016/j.arthro.2024.09.020. Epub 2024 Sep 21.
To assess the ability of ChatGPT-4 and Gemini to generate accurate and relevant responses to the 2022 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPG) for anterior cruciate ligament reconstruction (ACLR).
Responses from ChatGPT-4 and Gemini to prompts derived from all 15 AAOS guidelines were evaluated by 7 fellowship-trained orthopaedic sports medicine surgeons using a structured questionnaire assessing 5 key characteristics on a scale from 1 to 5. The prompts were categorized into 3 areas: diagnosis and preoperative management, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and 2-sided t tests to compare the performance between the 2 large language models (LLMs). Scores were then evaluated for inter-rater reliability (IRR).
Overall, both LLMs performed well with mean scores >4 for the 5 key characteristics. Gemini demonstrated superior performance in overall clarity (4.848 ± 0.36 vs 4.743 ± 0.481, P = .034), but all other characteristics demonstrated nonsignificant differences (P > .05). Gemini also demonstrated superior clarity in the surgical timing and technique (P = .038) as well as the prevention and rehabilitation (P = .044) subcategories. Additionally, Gemini had superior performance completeness scores in the rehabilitation and prevention subcategory (P = .044), but no statistically significant differences were found amongst the other subcategories. The overall IRR was found to be 0.71 (moderate).
Both Gemini and ChatGPT-4 demonstrate an overall good ability to generate accurate and relevant responses to question prompts based on the 2022 AAOS CPG for ACLR. However, Gemini demonstrated superior clarity in multiple domains in addition to superior completeness for questions pertaining to rehabilitation and prevention.
The current study addresses a current gap in the LLM and ACLR literature by comparing the performance of ChatGPT-4 to Gemini, which is growing in popularity with more than 300 million individual uses in May 2024 alone. Moreover, the results demonstrated superior performance of Gemini in both clarity and completeness, which are critical elements of a tool being used by patients for educational purposes. Additionally, the current study uses question prompts based on the AAOS CPG, which may be used as a method of standardization for future investigations on performance of LLM platforms. Thus, the results of this study may be of interest to both the readership of Arthroscopy and patients.
评估ChatGPT-4和Gemini针对2022年美国骨科医师学会(AAOS)前交叉韧带重建(ACLR)临床实践指南(CPG)生成准确且相关回答的能力。
7名接受过专科培训的骨科运动医学外科医生使用一份结构化问卷对ChatGPT-4和Gemini针对源自AAOS所有15项指南的提示所给出的回答进行评估,该问卷从1到5分对5个关键特征进行评分。这些提示被分为3个领域:诊断与术前管理、手术时机与技术、康复与预防。统计分析包括平均评分、标准差以及双侧t检验,以比较这两个大语言模型(LLM)的表现。然后对评分进行评估以确定评分者间信度(IRR)。
总体而言,两个大语言模型在5个关键特征上的平均得分均>4,表现良好。Gemini在整体清晰度方面表现更优(4.848±0.36对4.743±0.481,P = 0.034),但所有其他特征显示无显著差异(P>0.05)。Gemini在手术时机与技术(P = 0.038)以及预防与康复(P = 0.044)子类别中也表现出更高的清晰度。此外,Gemini在康复与预防子类别中的表现完整性得分更高(P = 0.044),但在其他子类别中未发现统计学上的显著差异。总体IRR为0.71(中等)。
Gemini和ChatGPT-4在根据2022年AAOS关于ACLR的CPG生成针对问题提示的准确且相关回答方面总体能力良好。然而,Gemini在多个领域表现出更高的清晰度,并且在与康复和预防相关的问题上表现完整性更高。
本研究通过比较ChatGPT-4与Gemini的表现,填补了大语言模型与ACLR文献中的当前空白,Gemini越来越受欢迎,仅在2024年5月就有超过3亿次个人使用。此外,结果表明Gemini在清晰度和完整性方面表现更优,这是患者用于教育目的的工具的关键要素。此外,本研究使用基于AAOS CPG的问题提示,这可作为未来大语言模型平台性能研究的标准化方法。因此,本研究结果可能对关节镜杂志的读者和患者都有意义。