用于颈椎病的大语言模型基准测试

Benchmarking Large Language Models for Cervical Spondylosis.

作者信息

Zhang Boyan, Du Yueqi, Duan Wanru, Chen Zan

机构信息

Xuanwu Hospital, Capital Medical University, Beijing, China.

Lab of Spinal Cord Injury and Functional Reconstruction, China International Neuroscience Institute, Beijing, China.

出版信息

JMIR Form Res. 2024 Aug 5;8:e55577. doi: 10.2196/55577.

DOI:10.2196/55577

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11333861/

Abstract

Cervical spondylosis is the most common degenerative spinal disorder in modern societies. Patients require a great deal of medical knowledge, and large language models (LLMs) offer patients a novel and convenient tool for accessing medical advice. In this study, we collected the most frequently asked questions by patients with cervical spondylosis in clinical work and internet consultations. The accuracy of the answers provided by LLMs was evaluated and graded by 3 experienced spinal surgeons. Comparative analysis of responses showed that all LLMs could provide satisfactory results, and that among them, GPT-4 had the highest accuracy rate. Variation across each section in all LLMs revealed their ability boundaries and the development direction of artificial intelligence.

摘要

颈椎病是现代社会中最常见的脊柱退行性疾病。患者需要大量的医学知识，而大语言模型为患者提供了一种获取医疗建议的新颖且便捷的工具。在本研究中，我们收集了颈椎病患者在临床工作和互联网咨询中最常提出的问题。由3位经验丰富的脊柱外科医生对大语言模型提供的答案的准确性进行评估和分级。对回答的比较分析表明，所有大语言模型都能提供令人满意的结果，其中GPT-4的准确率最高。所有大语言模型各部分的差异揭示了它们的能力边界和人工智能的发展方向。

相似文献

1

Benchmarking Large Language Models for Cervical Spondylosis.用于颈椎病的大语言模型基准测试

JMIR Form Res. 2024 Aug 5;8:e55577. doi: 10.2196/55577.

2

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较：评估研究。

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

3

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources.大型语言模型和减重手术患者教育：GPT-3.5、GPT-4、Bard 与在线机构资源的可读性比较分析。

Surg Endosc. 2024 May;38(5):2522-2532. doi: 10.1007/s00464-024-10720-2. Epub 2024 Mar 12.

4

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力：ChatGPT、谷歌巴德和微软必应的比较研究

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

5

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型：GPT-3.5、GPT-4 和 Bard 的比较分析。

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

6

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能：比较混合方法研究。

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

7

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现：大型语言模型的基准测试。

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

8

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量：评估研究

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

9

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较：大型语言模型、ChatGPT 和未经训练的急诊医生：一项对比研究。

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

10

Urology consultants versus large language models: Potentials and hazards for medical advice in urology.泌尿外科顾问与大语言模型：泌尿外科医疗建议的潜力与风险

BJUI Compass. 2024 Apr 3;5(5):438-444. doi: 10.1002/bco2.359. eCollection 2024 May.

引用本文的文献

1

Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.大语言模型对脊髓损伤的反应：性能比较研究

J Med Syst. 2025 Mar 25;49(1):39. doi: 10.1007/s10916-025-02170-7.

本文引用的文献

1

Potential Multidisciplinary Use of Large Language Models for Addressing Queries in Cardio-Oncology.大语言模型在心脏肿瘤学中解决问题的潜在多学科应用。

J Am Heart Assoc. 2024 Mar 19;13(6):e033584. doi: 10.1161/JAHA.123.033584. Epub 2024 Mar 18.

2

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现：大型语言模型的基准测试。

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

3

Assessing the Accuracy of an Online Chat-Based Artificial Intelligence Model in Providing Recommendations on Hypertension Management in Accordance With the 2017 American College of Cardiology/American Heart Association and 2018 European Society of Cardiology/European Society of Hypertension Guidelines.根据2017年美国心脏病学会/美国心脏协会以及2018年欧洲心脏病学会/欧洲高血压学会指南，评估基于在线聊天的人工智能模型在提供高血压管理建议方面的准确性。

Hypertension. 2023 Jul;80(7):e125-e127. doi: 10.1161/HYPERTENSIONAHA.123.21183. Epub 2023 May 16.

4

AI-Generated Medical Advice-GPT and Beyond.人工智能生成的医学建议——GPT及其他。

JAMA. 2023 Apr 25;329(16):1349-1350. doi: 10.1001/jama.2023.5321.

5

Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model.从一个基于在线聊天的流行人工智能模型获取的心血管疾病预防建议的适宜性。

JAMA. 2023 Mar 14;329(10):842-844. doi: 10.1001/jama.2023.1044.

6

Value of Surgery and Nonsurgical Approaches for Cervical Spondylotic Myelopathy: WFNS Spine Committee Recommendations.脊髓型颈椎病手术与非手术治疗方法的价值：世界神经外科联合会脊柱委员会建议

Neurospine. 2019 Sep;16(3):403-407. doi: 10.14245/ns.1938238.119. Epub 2019 Sep 30.

7

A Clinical Practice Guideline for the Management of Patients With Degenerative Cervical Myelopathy: Recommendations for Patients With Mild, Moderate, and Severe Disease and Nonmyelopathic Patients With Evidence of Cord Compression.退行性颈椎脊髓病患者管理临床实践指南：针对轻度、中度和重度疾病患者以及有脊髓受压证据的非脊髓病患者的建议

Global Spine J. 2017 Sep;7(3 Suppl):70S-83S. doi: 10.1177/2192568217701914. Epub 2017 Sep 5.

8

A Clinical Practice Guideline for the Management of Degenerative Cervical Myelopathy: Introduction, Rationale, and Scope.《退行性颈椎脊髓病管理临床实践指南：引言、原理及范围》

Global Spine J. 2017 Sep;7(3 Suppl):21S-27S. doi: 10.1177/2192568217703088. Epub 2017 Sep 5.

9

Update on the Diagnosis and Management of Cervical Spondylotic Myelopathy.脊髓型颈椎病的诊断与治疗进展

J Am Acad Orthop Surg. 2015 Nov;23(11):648-60. doi: 10.5435/JAAOS-D-14-00250.

10

Epidemiology of cervical spondylotic myelopathy and its risk of causing spinal cord injury: a national cohort study.颈椎病性脊髓病的流行病学及其导致脊髓损伤的风险：一项全国队列研究。

Neurosurg Focus. 2013 Jul;35(1):E10. doi: 10.3171/2013.4.FOCUS13122.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。