大语言模型对脊髓损伤的反应：性能比较研究

Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.

作者信息

Li Jinze, Chang Chao, Li Yanqiu, Cui Shengyu, Yuan Fan, Li Zhuojun, Wang Xinyu, Li Kang, Feng Yuxin, Wang Zuowei, Wei Zhijian, Jian Fengzeng

机构信息

Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China.

Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China.

出版信息

J Med Syst. 2025 Mar 25;49(1):39. doi: 10.1007/s10916-025-02170-7.

DOI:10.1007/s10916-025-02170-7

PMID:40128385

Abstract

With the increasing application of large language models (LLMs) in the medical field, their potential in patient education and clinical decision support is becoming increasingly prominent. Given the complex pathogenesis, diverse treatment options, and lengthy rehabilitation periods of spinal cord injury (SCI), patients are increasingly turning to advanced online resources to obtain relevant medical information. This study analyzed responses from four LLMs-ChatGPT-4o, Claude-3.5 sonnet, Gemini-1.5 Pro, and Llama-3.1-to 37 SCI-related questions spanning pathogenesis, risk factors, clinical features, diagnostics, treatments, and prognosis. Quality and readability were assessed using the Ensuring Quality Information for Patients (EQIP) tool and Flesch-Kincaid metrics, respectively. Accuracy was independently scored by three senior spine surgeons using consensus scoring. Performance varied among the models. Gemini ranked highest in EQIP scores, suggesting superior information quality. Although the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they were all able to effectively simplify complex content. Notably, ChatGPT led in accuracy, achieving significantly higher "Good" ratings (83.8%) compared to Claude (78.4%), Gemini (54.1%), and Llama (62.2%). Comprehensiveness scores were high across all models. Furthermore, the LLMs exhibited strong self-correction abilities. After being prompted for revision, the accuracy of ChatGPT and Claude's responses improved by 100% and 50%, respectively; both Gemini and Llama improved by 67%. This study represents the first systematic comparison of leading LLMs in the context of SCI. While Gemini excelled in response quality, ChatGPT provided the most accurate and comprehensive responses.

摘要

随着大语言模型（LLMs）在医学领域的应用日益增加，它们在患者教育和临床决策支持方面的潜力正变得越来越突出。鉴于脊髓损伤（SCI）的发病机制复杂、治疗选择多样且康复期漫长，患者越来越多地转向先进的在线资源以获取相关医学信息。本研究分析了四个大语言模型——ChatGPT-4o、Claude-3.5 sonnet、Gemini-1.5 Pro和Llama-3.1——对37个与SCI相关问题的回答，这些问题涵盖发病机制、危险因素、临床特征、诊断、治疗和预后。分别使用“确保患者获得高质量信息”（EQIP）工具和弗莱什-金凯德指标评估回答的质量和可读性。由三位资深脊柱外科医生采用共识评分法独立对回答的准确性进行评分。各模型的表现存在差异。Gemini在EQIP评分中排名最高，表明其信息质量更优。尽管所有四个大语言模型的可读性普遍较低，需要大学水平的阅读理解能力，但它们都能够有效地简化复杂内容。值得注意的是，ChatGPT在准确性方面领先，与Claude（78.4%）、Gemini（54.1%）和Llama（62.2%）相比，获得“良好”评分的比例显著更高（83.8%）。所有模型的全面性得分都很高。此外，大语言模型表现出很强的自我纠错能力。在被提示进行修订后，ChatGPT和Claude回答的准确性分别提高了100%和50%；Gemini和Llama均提高了67%。本研究是在SCI背景下对领先大语言模型的首次系统比较。虽然Gemini在回答质量方面表现出色，但ChatGPT提供了最准确和全面的回答。

相似文献

Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.大语言模型对脊髓损伤的反应：性能比较研究

J Med Syst. 2025 Mar 25;49(1):39. doi: 10.1007/s10916-025-02170-7.

The actual performance of large language models in providing liver cirrhosis-related information: A comparative study.大语言模型在提供肝硬化相关信息方面的实际表现：一项比较研究。

Int J Med Inform. 2025 Sep;201:105961. doi: 10.1016/j.ijmedinf.2025.105961. Epub 2025 May 5.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study.探索大语言模型在乙型肝炎感染相关问题上的表现：一项比较研究。

World J Gastroenterol. 2025 Jan 21;31(3):101092. doi: 10.3748/wjg.v31.i3.101092.

Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.在回答患者问题方面，大型语言模型聊天机器人的表现是否优于成熟的患者信息资源？一项关于黑色素瘤的比较研究。

Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.

Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity.探索ChatGPT-4、必应人工智能和Gemini作为虚拟顾问在向家庭普及早产儿视网膜病变知识方面的作用。

Children (Basel). 2024 Jun 20;11(6):750. doi: 10.3390/children11060750.

Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.评估大型语言模型（ChatGPT-4、Claude 3、Gemini和Microsoft Copilot）对早产儿视网膜病变常见问题的回答：一项关于可读性和适宜性的研究

J Pediatr Ophthalmol Strabismus. 2025 Mar-Apr;62(2):84-95. doi: 10.3928/01913913-20240911-05. Epub 2024 Oct 28.

Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.评估大语言模型为中国重症肌无力性眼病患者提供患者教育的有效性：混合方法研究

J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.

Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis.ChatGPT-4o mini、ChatGPT-4o与Gemini Advanced在绝经后骨质疏松症治疗中的对比分析。

BMC Musculoskelet Disord. 2025 Apr 16;26(1):369. doi: 10.1186/s12891-025-08601-3.

本文引用的文献

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

From open-ended to multiple-choice: evaluating diagnostic performance and consistency of ChatGPT, Google Gemini and Claude AI.从开放式到多项选择题：评估ChatGPT、谷歌Gemini和Claude AI的诊断性能与一致性。

Wiad Lek. 2024;77(10):1852-1856. doi: 10.36740/WLek/195125.

Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.在大体解剖学课程中使用大语言模型（ChatGPT、Copilot、PaLM、Bard和Gemini）：比较分析

Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.

Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.ChatGPT 及其他会话型大型语言模型在医疗保健中的应用及关注：系统评价。

J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力：定性研究

JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.

Safety and potential effects of intrathecal injection of allogeneic human umbilical cord mesenchymal stem cell-derived exosomes in complete subacute spinal cord injury: a first-in-human, single-arm, open-label, phase I clinical trial.同种异体人脐带间充质干细胞来源的胞外体鞘内注射在完全亚急性脊髓损伤中的安全性和潜在作用：首例人体、单臂、开放标签、I 期临床试验。

Stem Cell Res Ther. 2024 Aug 26;15(1):264. doi: 10.1186/s13287-024-03868-0.

Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023.大语言模型应用于心胸外科手术：2023年四种模型在美国胸外科医师委员会考试题目上的性能对比分析

Cureus. 2024 Jul 22;16(7):e65083. doi: 10.7759/cureus.65083. eCollection 2024 Jul.

Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.评估 ChatGPT®、BARD®、 Gemini®、Copilot®、Perplexity® 在姑息治疗方面的可读性、可靠性和质量。

Medicine (Baltimore). 2024 Aug 16;103(33):e39305. doi: 10.1097/MD.0000000000039305.

Benchmarking Large Language Models for Cervical Spondylosis.用于颈椎病的大语言模型基准测试

JMIR Form Res. 2024 Aug 5;8:e55577. doi: 10.2196/55577.

Impact of Large Language Models on Medical Education and Teaching Adaptations.大语言模型对医学教育及教学适应性的影响

JMIR Med Inform. 2024 Jul 25;12:e55933. doi: 10.2196/55933.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

大语言模型对脊髓损伤的反应：性能比较研究

Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献