大型语言模型在内镜腰椎手术中的性能评估：一项比较分析。

Evaluation of the performance of large language models in endoscopic lumbar surgery: a comparative analysis.

作者信息

Li Hao, Zeng Cheng, Miao Lei, Wang Ye, Xia Jiyuan, He Da

机构信息

Department of Orthopedics, Beijing Jishuitan Hospital, Capital Medical University, Beijing, PR China.

School of Computer Science, South China Business College of Guangdong University of Foreign Studies, Guangzhou, Guangdong, PR China.

出版信息

Ann Med Surg (Lond). 2025 Jun 30;87(8):4835-4840. doi: 10.1097/MS9.0000000000003519. eCollection 2025 Aug.

DOI:10.1097/MS9.0000000000003519

PMID:40787572

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12333696/

Abstract

OBJECTIVE

This study aimed to evaluate and compare the performance of three large language models (LLMs)-ChatGPT o1-preview, Claude 3.5 Sonnet, and Gemini 1.5 Pro-in providing information on endoscopic lumbar surgery based on 10 frequently asked patient questions.

METHODS

The 10 high-frequently asked patient questions about endoscopic lumbar surgery were selected through discussion among authors. These questions were then submitted to the three LLMs. Responses were evaluated by five spine surgeons using a 5-point Likert scale for overall quality, text readability, content relevance, and humanistic care. Additionally, five non-medical volunteers assessed the understandability and satisfaction of the responses.

RESULTS

The intraclass correlation coefficients of ChatGPT o1-preview, Claude 3.5 Sonnet, and Gemini 1.5 Pro of the five evaluators were 0.522, 0.686, and 0.512, respectively. Claude 3.5 Sonnet received the highest scores for overall quality (4.86 ± 0.35, <0.001), text readability (4.91 ± 0.32, <0.001), and content relevance (4.78 ± 0.42, <0.001). ChatGPT o1-preview was the most approved by non-medical background volunteers (49%), followed by Gemini 1.5 Pro (29%) and Claude 3.5 Sonnet (22%).

CONCLUSION

From the perspective of professional surgeons, Claude 3.5 Sonnet provided the highest quality and most relevant information. However, ChatGPT o1-preview was more understandable and satisfactory for non-professional users. This study not only highlights the potential of LLMs in patient education but also emphasizes the need for careful consideration of their role in medical practice, including technical limitations and ethical issues.

摘要

目的

本研究旨在评估和比较三种大语言模型（LLMs）——ChatGPT o1-preview、Claude 3.5 Sonnet和Gemini 1.5 Pro——基于10个患者常见问题提供内镜下腰椎手术信息的表现。

方法

通过作者间的讨论，选取了10个关于内镜下腰椎手术的患者常见问题。然后将这些问题提交给这三种大语言模型。由五位脊柱外科医生使用5分李克特量表对回答的整体质量、文本可读性、内容相关性和人文关怀进行评估。此外，五位非医学志愿者评估了回答的可理解性和满意度。

结果

五位评估者对ChatGPT o1-preview、Claude 3.5 Sonnet和Gemini 1.5 Pro的组内相关系数分别为0.522、0.686和0.512。Claude 3.5 Sonnet在整体质量（4.86±0.35，<0.001）、文本可读性（4.91±0.32，<0.001）和内容相关性（4.78±0.42，<0.001）方面得分最高。ChatGPT o1-preview最受非医学背景志愿者认可（49%），其次是Gemini 1.5 Pro（29%）和Claude 3.5 Sonnet（22%）。

结论

从专业外科医生的角度来看，Claude 3.5 Sonnet提供了质量最高且最相关的信息。然而，ChatGPT o1-preview对非专业用户来说更易理解且更令人满意。本研究不仅突出了大语言模型在患者教育中的潜力，还强调了在医疗实践中仔细考虑其作用的必要性，包括技术局限性和伦理问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f1b/12333696/05bb0b956722/ms9-87-4835-g001.jpg

相似文献

Evaluation of the performance of large language models in endoscopic lumbar surgery: a comparative analysis.

Ann Med Surg (Lond). 2025 Jun 30;87(8):4835-4840. doi: 10.1097/MS9.0000000000003519. eCollection 2025 Aug.

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

Accuracy of ChatGPT, Gemini, Copilot, and Claude to Blepharoplasty-Related Questions.

Aesthetic Plast Surg. 2025 Jul 21. doi: 10.1007/s00266-025-05071-9.

Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.

Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.

Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?

Clin Microbiol Infect. 2025 Aug;31(8):1336-1342. doi: 10.1016/j.cmi.2025.03.002. Epub 2025 Mar 19.

Evaluation of ChatGPT-4o, Claude 3.5 Sonnet, and Google Gemini 2.0 Flash as Patient Education Resources for Upper Blepharoplasty Patients.

J Craniofac Surg. 2025 Jul 7. doi: 10.1097/SCS.0000000000011608.

Evaluating the Performance of State-of-the-Art Artificial Intelligence Chatbots Based on the WHO Global Guidelines for the Prevention of Surgical Site Infection: Cross-Sectional Study.

J Med Internet Res. 2025 Jul 31;27:e75567. doi: 10.2196/75567.

Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study.

J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.

Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.

JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.

本文引用的文献

AI integration in pediatric surgery: bridging innovation, equity, and adaptive decision-making.

Pediatr Surg Int. 2025 Mar 12;41(1):93. doi: 10.1007/s00383-025-05993-0.

DeepSeek in Healthcare: Revealing Opportunities and Steering Challenges of a New Open-Source Artificial Intelligence Frontier.

Cureus. 2025 Feb 18;17(2):e79221. doi: 10.7759/cureus.79221. eCollection 2025 Feb.

Comparative study of Claude 3.5-Sonnet and human physicians in generating discharge summaries for patients with renal insufficiency: assessment of efficiency, accuracy, and quality.

Front Digit Health. 2024 Dec 5;6:1456911. doi: 10.3389/fdgth.2024.1456911. eCollection 2024.

Comparative Analysis of Large Language Models and Spine Surgeons in Surgical Decision-Making and Radiological Assessment for Spine Pathologies.

World Neurosurg. 2025 Feb;194:123531. doi: 10.1016/j.wneu.2024.11.114. Epub 2024 Dec 23.

Large language models in patient education: a scoping review of applications in medicine.

Front Med (Lausanne). 2024 Oct 29;11:1477898. doi: 10.3389/fmed.2024.1477898. eCollection 2024.

Evaluating the Adherence of Large Language Models to Surgical Guidelines: A Comparative Analysis of Chatbot Recommendations and North American Spine Society (NASS) Coverage Criteria.

Cureus. 2024 Sep 3;16(9):e68521. doi: 10.7759/cureus.68521. eCollection 2024 Sep.

Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.

Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.

Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.

Neurospine. 2024 Jun;21(2):633-641. doi: 10.14245/ns.2448098.049. Epub 2024 Jun 30.

Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.

Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.

Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content.

J Med Internet Res. 2024 Apr 25;26:e55847. doi: 10.2196/55847.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大型语言模型在内镜腰椎手术中的性能评估：一项比较分析。

Evaluation of the performance of large language models in endoscopic lumbar surgery: a comparative analysis.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献