Suppr超能文献

大语言模型对脊髓损伤的反应:性能比较研究

Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.

作者信息

Li Jinze, Chang Chao, Li Yanqiu, Cui Shengyu, Yuan Fan, Li Zhuojun, Wang Xinyu, Li Kang, Feng Yuxin, Wang Zuowei, Wei Zhijian, Jian Fengzeng

机构信息

Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China.

Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China.

出版信息

J Med Syst. 2025 Mar 25;49(1):39. doi: 10.1007/s10916-025-02170-7.

Abstract

With the increasing application of large language models (LLMs) in the medical field, their potential in patient education and clinical decision support is becoming increasingly prominent. Given the complex pathogenesis, diverse treatment options, and lengthy rehabilitation periods of spinal cord injury (SCI), patients are increasingly turning to advanced online resources to obtain relevant medical information. This study analyzed responses from four LLMs-ChatGPT-4o, Claude-3.5 sonnet, Gemini-1.5 Pro, and Llama-3.1-to 37 SCI-related questions spanning pathogenesis, risk factors, clinical features, diagnostics, treatments, and prognosis. Quality and readability were assessed using the Ensuring Quality Information for Patients (EQIP) tool and Flesch-Kincaid metrics, respectively. Accuracy was independently scored by three senior spine surgeons using consensus scoring. Performance varied among the models. Gemini ranked highest in EQIP scores, suggesting superior information quality. Although the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they were all able to effectively simplify complex content. Notably, ChatGPT led in accuracy, achieving significantly higher "Good" ratings (83.8%) compared to Claude (78.4%), Gemini (54.1%), and Llama (62.2%). Comprehensiveness scores were high across all models. Furthermore, the LLMs exhibited strong self-correction abilities. After being prompted for revision, the accuracy of ChatGPT and Claude's responses improved by 100% and 50%, respectively; both Gemini and Llama improved by 67%. This study represents the first systematic comparison of leading LLMs in the context of SCI. While Gemini excelled in response quality, ChatGPT provided the most accurate and comprehensive responses.

摘要

随着大语言模型(LLMs)在医学领域的应用日益增加,它们在患者教育和临床决策支持方面的潜力正变得越来越突出。鉴于脊髓损伤(SCI)的发病机制复杂、治疗选择多样且康复期漫长,患者越来越多地转向先进的在线资源以获取相关医学信息。本研究分析了四个大语言模型——ChatGPT-4o、Claude-3.5 sonnet、Gemini-1.5 Pro和Llama-3.1——对37个与SCI相关问题的回答,这些问题涵盖发病机制、危险因素、临床特征、诊断、治疗和预后。分别使用“确保患者获得高质量信息”(EQIP)工具和弗莱什-金凯德指标评估回答的质量和可读性。由三位资深脊柱外科医生采用共识评分法独立对回答的准确性进行评分。各模型的表现存在差异。Gemini在EQIP评分中排名最高,表明其信息质量更优。尽管所有四个大语言模型的可读性普遍较低,需要大学水平的阅读理解能力,但它们都能够有效地简化复杂内容。值得注意的是,ChatGPT在准确性方面领先,与Claude(78.4%)、Gemini(54.1%)和Llama(62.2%)相比,获得“良好”评分的比例显著更高(83.8%)。所有模型的全面性得分都很高。此外,大语言模型表现出很强的自我纠错能力。在被提示进行修订后,ChatGPT和Claude回答的准确性分别提高了100%和50%;Gemini和Llama均提高了67%。本研究是在SCI背景下对领先大语言模型的首次系统比较。虽然Gemini在回答质量方面表现出色,但ChatGPT提供了最准确和全面的回答。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验