Li Hao, Zeng Cheng, Miao Lei, Wang Ye, Xia Jiyuan, He Da
Department of Orthopedics, Beijing Jishuitan Hospital, Capital Medical University, Beijing, PR China.
School of Computer Science, South China Business College of Guangdong University of Foreign Studies, Guangzhou, Guangdong, PR China.
Ann Med Surg (Lond). 2025 Jun 30;87(8):4835-4840. doi: 10.1097/MS9.0000000000003519. eCollection 2025 Aug.
This study aimed to evaluate and compare the performance of three large language models (LLMs)-ChatGPT o1-preview, Claude 3.5 Sonnet, and Gemini 1.5 Pro-in providing information on endoscopic lumbar surgery based on 10 frequently asked patient questions.
The 10 high-frequently asked patient questions about endoscopic lumbar surgery were selected through discussion among authors. These questions were then submitted to the three LLMs. Responses were evaluated by five spine surgeons using a 5-point Likert scale for overall quality, text readability, content relevance, and humanistic care. Additionally, five non-medical volunteers assessed the understandability and satisfaction of the responses.
The intraclass correlation coefficients of ChatGPT o1-preview, Claude 3.5 Sonnet, and Gemini 1.5 Pro of the five evaluators were 0.522, 0.686, and 0.512, respectively. Claude 3.5 Sonnet received the highest scores for overall quality (4.86 ± 0.35, <0.001), text readability (4.91 ± 0.32, <0.001), and content relevance (4.78 ± 0.42, <0.001). ChatGPT o1-preview was the most approved by non-medical background volunteers (49%), followed by Gemini 1.5 Pro (29%) and Claude 3.5 Sonnet (22%).
From the perspective of professional surgeons, Claude 3.5 Sonnet provided the highest quality and most relevant information. However, ChatGPT o1-preview was more understandable and satisfactory for non-professional users. This study not only highlights the potential of LLMs in patient education but also emphasizes the need for careful consideration of their role in medical practice, including technical limitations and ethical issues.
本研究旨在评估和比较三种大语言模型(LLMs)——ChatGPT o1-preview、Claude 3.5 Sonnet和Gemini 1.5 Pro——基于10个患者常见问题提供内镜下腰椎手术信息的表现。
通过作者间的讨论,选取了10个关于内镜下腰椎手术的患者常见问题。然后将这些问题提交给这三种大语言模型。由五位脊柱外科医生使用5分李克特量表对回答的整体质量、文本可读性、内容相关性和人文关怀进行评估。此外,五位非医学志愿者评估了回答的可理解性和满意度。
五位评估者对ChatGPT o1-preview、Claude 3.5 Sonnet和Gemini 1.5 Pro的组内相关系数分别为0.522、0.686和0.512。Claude 3.5 Sonnet在整体质量(4.86±0.35,<0.001)、文本可读性(4.91±0.32,<0.001)和内容相关性(4.78±0.42,<0.001)方面得分最高。ChatGPT o1-preview最受非医学背景志愿者认可(49%),其次是Gemini 1.5 Pro(29%)和Claude 3.5 Sonnet(22%)。
从专业外科医生的角度来看,Claude 3.5 Sonnet提供了质量最高且最相关的信息。然而,ChatGPT o1-preview对非专业用户来说更易理解且更令人满意。本研究不仅突出了大语言模型在患者教育中的潜力,还强调了在医疗实践中仔细考虑其作用的必要性,包括技术局限性和伦理问题。