Suppr超能文献

大语言模型对脊髓损伤的反应:性能比较研究

Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.

作者信息

Li Jinze, Chang Chao, Li Yanqiu, Cui Shengyu, Yuan Fan, Li Zhuojun, Wang Xinyu, Li Kang, Feng Yuxin, Wang Zuowei, Wei Zhijian, Jian Fengzeng

机构信息

Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China.

Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China.

出版信息

J Med Syst. 2025 Mar 25;49(1):39. doi: 10.1007/s10916-025-02170-7.

Abstract

With the increasing application of large language models (LLMs) in the medical field, their potential in patient education and clinical decision support is becoming increasingly prominent. Given the complex pathogenesis, diverse treatment options, and lengthy rehabilitation periods of spinal cord injury (SCI), patients are increasingly turning to advanced online resources to obtain relevant medical information. This study analyzed responses from four LLMs-ChatGPT-4o, Claude-3.5 sonnet, Gemini-1.5 Pro, and Llama-3.1-to 37 SCI-related questions spanning pathogenesis, risk factors, clinical features, diagnostics, treatments, and prognosis. Quality and readability were assessed using the Ensuring Quality Information for Patients (EQIP) tool and Flesch-Kincaid metrics, respectively. Accuracy was independently scored by three senior spine surgeons using consensus scoring. Performance varied among the models. Gemini ranked highest in EQIP scores, suggesting superior information quality. Although the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they were all able to effectively simplify complex content. Notably, ChatGPT led in accuracy, achieving significantly higher "Good" ratings (83.8%) compared to Claude (78.4%), Gemini (54.1%), and Llama (62.2%). Comprehensiveness scores were high across all models. Furthermore, the LLMs exhibited strong self-correction abilities. After being prompted for revision, the accuracy of ChatGPT and Claude's responses improved by 100% and 50%, respectively; both Gemini and Llama improved by 67%. This study represents the first systematic comparison of leading LLMs in the context of SCI. While Gemini excelled in response quality, ChatGPT provided the most accurate and comprehensive responses.

摘要

随着大语言模型(LLMs)在医学领域的应用日益增加,它们在患者教育和临床决策支持方面的潜力正变得越来越突出。鉴于脊髓损伤(SCI)的发病机制复杂、治疗选择多样且康复期漫长,患者越来越多地转向先进的在线资源以获取相关医学信息。本研究分析了四个大语言模型——ChatGPT-4o、Claude-3.5 sonnet、Gemini-1.5 Pro和Llama-3.1——对37个与SCI相关问题的回答,这些问题涵盖发病机制、危险因素、临床特征、诊断、治疗和预后。分别使用“确保患者获得高质量信息”(EQIP)工具和弗莱什-金凯德指标评估回答的质量和可读性。由三位资深脊柱外科医生采用共识评分法独立对回答的准确性进行评分。各模型的表现存在差异。Gemini在EQIP评分中排名最高,表明其信息质量更优。尽管所有四个大语言模型的可读性普遍较低,需要大学水平的阅读理解能力,但它们都能够有效地简化复杂内容。值得注意的是,ChatGPT在准确性方面领先,与Claude(78.4%)、Gemini(54.1%)和Llama(62.2%)相比,获得“良好”评分的比例显著更高(83.8%)。所有模型的全面性得分都很高。此外,大语言模型表现出很强的自我纠错能力。在被提示进行修订后,ChatGPT和Claude回答的准确性分别提高了100%和50%;Gemini和Llama均提高了67%。本研究是在SCI背景下对领先大语言模型的首次系统比较。虽然Gemini在回答质量方面表现出色,但ChatGPT提供了最准确和全面的回答。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验