Liu Yukang, Li Hua, Ouyang Jianfeng, Xue Zhaowen, Wang Min, He Hebei, Song Bin, Zheng Xiaofei, Gan Wenyi
The Second School of Clinical Medicine, Southern Medical University, Guangzhou, China.
Department of Orthopedics, Beijing Jishuitan Hospital, Beijing, China.
JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.
Large language models (LLMs) are revolutionizing natural language processing, increasingly applied in clinical settings to enhance preoperative patient education.
This study aimed to evaluate the effectiveness and applicability of various LLMs in preoperative patient education by analyzing their responses to superior capsular reconstruction (SCR)-related inquiries.
In total, 10 sports medicine clinical experts formulated 11 SCR issues and developed preoperative patient education strategies during a webinar, inputting 12 text commands into Claude-3-Opus (Anthropic), GPT-4-Turbo (OpenAI), and Gemini-1.5-Pro (Google DeepMind). A total of 3 experts assessed the language models' responses for correctness, completeness, logic, potential harm, and overall satisfaction, while preoperative education documents were evaluated using DISCERN questionnaire and Patient Education Materials Assessment Tool instruments, and reviewed by 5 postoperative patients for readability and educational value; readability of all responses was also analyzed using the cntext package and py-readability-metrics.
Between July 1 and August 17, 2024, sports medicine experts and patients evaluated 33 responses and 3 preoperative patient education documents generated by 3 language models regarding SCR surgery. For the 11 query responses, clinicians rated Gemini significantly higher than Claude in all categories (P<.05) and higher than GPT in completeness, risk avoidance, and overall rating (P<.05). For the 3 educational documents, Gemini's Patient Education Materials Assessment Tool score significantly exceeded Claude's (P=.03), and patients rated Gemini's materials superior in all aspects, with significant differences in educational quality versus Claude (P=.02) and overall satisfaction versus both Claude (P<.01) and GPT (P=.01). GPT had significantly higher readability than Claude on 3 R-based metrics (P<.01). Interrater agreement was high among clinicians and fair among patients.
Claude-3-Opus, GPT-4-Turbo, and Gemini-1.5-Pro effectively generated readable presurgical education materials but lacked citations and failed to discuss alternative treatments or the risks of forgoing SCR surgery, highlighting the need for expert oversight when using these LLMs in patient education.
大语言模型正在彻底改变自然语言处理,越来越多地应用于临床环境以加强术前患者教育。
本研究旨在通过分析各种大语言模型对肩胛上盂重建(SCR)相关询问的回答,评估其在术前患者教育中的有效性和适用性。
共有10位运动医学临床专家在一次网络研讨会上制定了11个SCR问题并制定了术前患者教育策略,将12条文本指令输入到Claude-3-Opus(Anthropic)、GPT-4-Turbo(OpenAI)和Gemini-1.5-Pro(谷歌DeepMind)中。共有3位专家评估语言模型回答的正确性、完整性、逻辑性、潜在危害和总体满意度,同时使用DISCERN问卷和患者教育材料评估工具对术前教育文档进行评估,并由5位术后患者对其可读性和教育价值进行审查;还使用cntext包和py-readability-metrics分析所有回答的可读性。
在2024年7月1日至8月17日期间,运动医学专家和患者评估了3个语言模型生成的33个关于SCR手术的回答和3份术前患者教育文档。对于11个查询回答,临床医生在所有类别中对Gemini的评分显著高于Claude(P<0.05),在完整性、风险规避和总体评分方面高于GPT(P<0.05)。对于3份教育文档,Gemini的患者教育材料评估工具得分显著超过Claude(P=0.03),患者在各个方面对Gemini的材料评价更高,在教育质量方面与Claude有显著差异(P=0.02),在总体满意度方面与Claude(P<0.01)和GPT(P=0.01)均有显著差异。在3个基于R的指标上,GPT的可读性显著高于Claude(P<0.01)。临床医生之间的评分者间一致性较高,患者之间的一致性一般。
Claude-3-Opus、GPT-4-Turbo和Gemini-1.5-Pro有效地生成了可读的术前教育材料,但缺乏参考文献,且未讨论替代治疗方法或放弃SCR手术的风险,这凸显了在患者教育中使用这些大语言模型时需要专家监督。