• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型在内镜腰椎手术中的性能评估:一项比较分析。

Evaluation of the performance of large language models in endoscopic lumbar surgery: a comparative analysis.

作者信息

Li Hao, Zeng Cheng, Miao Lei, Wang Ye, Xia Jiyuan, He Da

机构信息

Department of Orthopedics, Beijing Jishuitan Hospital, Capital Medical University, Beijing, PR China.

School of Computer Science, South China Business College of Guangdong University of Foreign Studies, Guangzhou, Guangdong, PR China.

出版信息

Ann Med Surg (Lond). 2025 Jun 30;87(8):4835-4840. doi: 10.1097/MS9.0000000000003519. eCollection 2025 Aug.

DOI:10.1097/MS9.0000000000003519
PMID:40787572
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12333696/
Abstract

OBJECTIVE

This study aimed to evaluate and compare the performance of three large language models (LLMs)-ChatGPT o1-preview, Claude 3.5 Sonnet, and Gemini 1.5 Pro-in providing information on endoscopic lumbar surgery based on 10 frequently asked patient questions.

METHODS

The 10 high-frequently asked patient questions about endoscopic lumbar surgery were selected through discussion among authors. These questions were then submitted to the three LLMs. Responses were evaluated by five spine surgeons using a 5-point Likert scale for overall quality, text readability, content relevance, and humanistic care. Additionally, five non-medical volunteers assessed the understandability and satisfaction of the responses.

RESULTS

The intraclass correlation coefficients of ChatGPT o1-preview, Claude 3.5 Sonnet, and Gemini 1.5 Pro of the five evaluators were 0.522, 0.686, and 0.512, respectively. Claude 3.5 Sonnet received the highest scores for overall quality (4.86 ± 0.35, <0.001), text readability (4.91 ± 0.32, <0.001), and content relevance (4.78 ± 0.42, <0.001). ChatGPT o1-preview was the most approved by non-medical background volunteers (49%), followed by Gemini 1.5 Pro (29%) and Claude 3.5 Sonnet (22%).

CONCLUSION

From the perspective of professional surgeons, Claude 3.5 Sonnet provided the highest quality and most relevant information. However, ChatGPT o1-preview was more understandable and satisfactory for non-professional users. This study not only highlights the potential of LLMs in patient education but also emphasizes the need for careful consideration of their role in medical practice, including technical limitations and ethical issues.

摘要

目的

本研究旨在评估和比较三种大语言模型(LLMs)——ChatGPT o1-preview、Claude 3.5 Sonnet和Gemini 1.5 Pro——基于10个患者常见问题提供内镜下腰椎手术信息的表现。

方法

通过作者间的讨论,选取了10个关于内镜下腰椎手术的患者常见问题。然后将这些问题提交给这三种大语言模型。由五位脊柱外科医生使用5分李克特量表对回答的整体质量、文本可读性、内容相关性和人文关怀进行评估。此外,五位非医学志愿者评估了回答的可理解性和满意度。

结果

五位评估者对ChatGPT o1-preview、Claude 3.5 Sonnet和Gemini 1.5 Pro的组内相关系数分别为0.522、0.686和0.512。Claude 3.5 Sonnet在整体质量(4.86±0.35,<0.001)、文本可读性(4.91±0.32,<0.001)和内容相关性(4.78±0.42,<0.001)方面得分最高。ChatGPT o1-preview最受非医学背景志愿者认可(49%),其次是Gemini 1.5 Pro(29%)和Claude 3.5 Sonnet(22%)。

结论

从专业外科医生的角度来看,Claude 3.5 Sonnet提供了质量最高且最相关的信息。然而,ChatGPT o1-preview对非专业用户来说更易理解且更令人满意。本研究不仅突出了大语言模型在患者教育中的潜力,还强调了在医疗实践中仔细考虑其作用的必要性,包括技术局限性和伦理问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f1b/12333696/d1fc36de08f6/ms9-87-4835-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f1b/12333696/05bb0b956722/ms9-87-4835-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f1b/12333696/392b82626bc6/ms9-87-4835-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f1b/12333696/d1fc36de08f6/ms9-87-4835-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f1b/12333696/05bb0b956722/ms9-87-4835-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f1b/12333696/392b82626bc6/ms9-87-4835-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f1b/12333696/d1fc36de08f6/ms9-87-4835-g003.jpg

相似文献

1
Evaluation of the performance of large language models in endoscopic lumbar surgery: a comparative analysis.大型语言模型在内镜腰椎手术中的性能评估:一项比较分析。
Ann Med Surg (Lond). 2025 Jun 30;87(8):4835-4840. doi: 10.1097/MS9.0000000000003519. eCollection 2025 Aug.
2
Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。
Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.
3
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
4
Accuracy of ChatGPT, Gemini, Copilot, and Claude to Blepharoplasty-Related Questions.ChatGPT、Gemini、Copilot和Claude对双眼皮手术相关问题的回答准确性。
Aesthetic Plast Surg. 2025 Jul 21. doi: 10.1007/s00266-025-05071-9.
5
Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.由大语言模型模拟的合成医患对话:多维评估
Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.
6
Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?比较不同临床场景下用于抗生素处方的大语言模型:哪种表现更佳?
Clin Microbiol Infect. 2025 Aug;31(8):1336-1342. doi: 10.1016/j.cmi.2025.03.002. Epub 2025 Mar 19.
7
Evaluation of ChatGPT-4o, Claude 3.5 Sonnet, and Google Gemini 2.0 Flash as Patient Education Resources for Upper Blepharoplasty Patients.评估ChatGPT-4o、Claude 3.5 Sonnet和Google Gemini 2.0 Flash作为上睑成形术患者的患者教育资源。
J Craniofac Surg. 2025 Jul 7. doi: 10.1097/SCS.0000000000011608.
8
Evaluating the Performance of State-of-the-Art Artificial Intelligence Chatbots Based on the WHO Global Guidelines for the Prevention of Surgical Site Infection: Cross-Sectional Study.基于世界卫生组织预防手术部位感染全球指南评估最先进的人工智能聊天机器人的性能:横断面研究
J Med Internet Res. 2025 Jul 31;27:e75567. doi: 10.2196/75567.
9
Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study.使用大语言模型提高在线患者教育材料的可读性:横断面研究。
J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.
10
Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.评估大语言模型在肩胛下肌上囊重建术前患者教育中的应用:Claude、GPT和Gemini的比较研究
JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.

本文引用的文献

1
AI integration in pediatric surgery: bridging innovation, equity, and adaptive decision-making.人工智能在小儿外科手术中的应用:连接创新、公平与适应性决策。
Pediatr Surg Int. 2025 Mar 12;41(1):93. doi: 10.1007/s00383-025-05993-0.
2
DeepSeek in Healthcare: Revealing Opportunities and Steering Challenges of a New Open-Source Artificial Intelligence Frontier.医疗保健领域的DeepSeek:揭示新开源人工智能前沿的机遇与导向挑战
Cureus. 2025 Feb 18;17(2):e79221. doi: 10.7759/cureus.79221. eCollection 2025 Feb.
3
Comparative study of Claude 3.5-Sonnet and human physicians in generating discharge summaries for patients with renal insufficiency: assessment of efficiency, accuracy, and quality.
Claude 3.5-Sonnet与人类医生为肾功能不全患者生成出院小结的比较研究:效率、准确性和质量评估
Front Digit Health. 2024 Dec 5;6:1456911. doi: 10.3389/fdgth.2024.1456911. eCollection 2024.
4
Comparative Analysis of Large Language Models and Spine Surgeons in Surgical Decision-Making and Radiological Assessment for Spine Pathologies.大语言模型与脊柱外科医生在脊柱疾病手术决策和放射学评估中的比较分析
World Neurosurg. 2025 Feb;194:123531. doi: 10.1016/j.wneu.2024.11.114. Epub 2024 Dec 23.
5
Large language models in patient education: a scoping review of applications in medicine.用于患者教育的大语言模型:医学应用的范围综述
Front Med (Lausanne). 2024 Oct 29;11:1477898. doi: 10.3389/fmed.2024.1477898. eCollection 2024.
6
Evaluating the Adherence of Large Language Models to Surgical Guidelines: A Comparative Analysis of Chatbot Recommendations and North American Spine Society (NASS) Coverage Criteria.评估大型语言模型对手术指南的遵循情况:聊天机器人推荐与北美脊柱学会(NASS)覆盖标准的对比分析
Cureus. 2024 Sep 3;16(9):e68521. doi: 10.7759/cureus.68521. eCollection 2024 Sep.
7
Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.Claude 3 Opus 和 Claude 3.5 Sonnet 基于病史和放射科“诊断请”病例关键图像的诊断性能。
Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.
8
Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.分析大语言模型对常见腰椎融合手术问题的回答:ChatGPT与Bard的比较
Neurospine. 2024 Jun;21(2):633-641. doi: 10.14245/ns.2448098.049. Epub 2024 Jun 30.
9
Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在“诊断请”案例中的诊断性能。
Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.
10
Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content.利用大语言模型改善患者就医机会和自我管理:专家生成内容与人工智能生成内容的评估者盲法比较
J Med Internet Res. 2024 Apr 25;26:e55847. doi: 10.2196/55847.