通过角色扮演提示增强大语言模型的回答：关于全膝关节置换术常见问题解答的比较研究

Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.

作者信息

Chen Yi-Chen, Lee Sheng-Hsun, Sheu Huan, Lin Sheng-Hsuan, Hu Chih-Chien, Fu Shih-Chen, Yang Cheng-Pang, Lin Yu-Chih

机构信息

Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan.

Institute of Statistics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan.

出版信息

BMC Med Inform Decis Mak. 2025 May 23;25(1):196. doi: 10.1186/s12911-025-03024-5.

DOI:10.1186/s12911-025-03024-5

PMID:40410756

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12102839/

Abstract

BACKGROUND

The application of artificial intelligence (AI) in medical education and patient interaction is rapidly growing. Large language models (LLMs) such as GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus have shown potential in providing relevant medical information. This study aims to evaluate and compare the performance of these LLMs in answering frequently asked questions (FAQs) about Total Knee Arthroplasty (TKA), with a specific focus on the impact of role-playing prompts.

METHODS

Four leading LLMs-GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus-were evaluated using ten standardized patient inquiries related to TKA. Each model produced two distinct responses per question: one generated under zero-shot prompting (question-only), and one under role-playing prompting (instructed to simulate an experienced orthopaedic surgeon). Four orthopaedic surgeons evaluated responses for accuracy and comprehensiveness on a 5-point Likert scale, along with a binary measure for acceptability. Statistical analyses (Wilcoxon rank sum and Chi-squared tests; P < 0.05) were conducted to compare model performance.

RESULTS

ChatGPT-4 with role-playing prompts achieved the highest scores for accuracy (3.73), comprehensiveness (4.05), and acceptability (77.5%), followed closely by ChatGPT-3.5 with role-playing prompts (3.70, 3.85, 72.5%, respectively). Google Gemini and Claude 3 Opus demonstrated lower performance across all metrics. In between-model comparisons based on zero-shot prompting, ChatGPT-4 achieved significantly higher scores of both accuracy and comprehensiveness relative to Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002), and demonstrated higher acceptability than Claude 3 Opus (P = 0.006). Within-model comparisons showed role-playing significantly improved all metrics for ChatGPT-3.5 (P < 0.05) and acceptability for ChatGPT-4 (P = 0.033). No significant prompting effects were observed for Gemini or Claude.

CONCLUSIONS

This study demonstrates that role-playing prompts significantly enhance the performance of LLMs, particularly for ChatGPT-3.5 and ChatGPT-4, in answering FAQs related to TKA. ChatGPT-4, with role-playing prompts, showed superior performance in terms of accuracy, comprehensiveness, and acceptability. Despite occasional inaccuracies, LLMs hold promise for improving patient education and clinical decision-making in orthopaedic practice.

CLINICAL TRIAL NUMBER

Not applicable.

摘要

背景

人工智能（AI）在医学教育和患者互动中的应用正在迅速增长。诸如GPT-3.5、GPT-4、谷歌Gemini和Claude 3 Opus等大语言模型在提供相关医学信息方面已显示出潜力。本研究旨在评估和比较这些大语言模型在回答有关全膝关节置换术（TKA）的常见问题（FAQ）时的表现，特别关注角色扮演提示的影响。

方法

使用与TKA相关的10个标准化患者询问对四个领先的大语言模型——GPT-3.5、GPT-4、谷歌Gemini和Claude 3 Opus进行评估。每个模型对每个问题生成两个不同的回答：一个在零样本提示（仅问题）下生成，另一个在角色扮演提示（被指示模拟经验丰富的骨科医生）下生成。四名骨科医生以5分李克特量表评估回答的准确性和全面性，以及以二元指标评估可接受性。进行统计分析（威尔科克森秩和检验和卡方检验；P<0.05）以比较模型性能。

结果

带有角色扮演提示的ChatGPT-4在准确性（3.73）、全面性（4.05）和可接受性（77.5%）方面得分最高，紧随其后的是带有角色扮演提示的ChatGPT-3.5（分别为3.70、3.85、72.5%）。谷歌Gemini和Claude 3 Opus在所有指标上表现较低。在基于零样本提示的模型间比较中，ChatGPT-4在准确性和全面性方面的得分相对于谷歌Gemini（分别为P=0.031和P=0.009）和Claude 3 Opus（P=0.019和P=0.002）显著更高，并且其可接受性高于Claude 3 Opus（P=0.006）。模型内比较显示，角色扮演显著提高了ChatGPT-3.5的所有指标（P<0.05）以及ChatGPT-4的可接受性（P=0.033）。未观察到Gemini或Claude有显著的提示效果。

结论

本研究表明，角色扮演提示显著提高了大语言模型在回答与TKA相关的常见问题时的性能，特别是对于ChatGPT-3.5和ChatGPT-4。带有角色扮演提示的ChatGPT-4在准确性、全面性和可接受性方面表现出色。尽管偶尔存在不准确之处，但大语言模型有望改善骨科实践中的患者教育和临床决策。

临床试验编号

不适用。

相似文献

Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.通过角色扮演提示增强大语言模型的回答：关于全膝关节置换术常见问题解答的比较研究

BMC Med Inform Decis Mak. 2025 May 23;25(1):196. doi: 10.1186/s12911-025-03024-5.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.三种基于人工智能（AI）的大语言模型在标准化测试中的表现；对人工智能辅助牙科教育的启示。

J Periodontal Res. 2025 Feb;60(2):121-133. doi: 10.1111/jre.13323. Epub 2024 Jul 18.

Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam.评估牙科麻醉学中的大语言模型：ChatGPT-4、Claude 3 Opus和Gemini 1.0在日本麻醉学牙科协会委员会认证考试中的比较分析。

Cureus. 2024 Sep 27;16(9):e70302. doi: 10.7759/cureus.70302. eCollection 2024 Sep.

Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures.大型语言模型在牙科手术中预防感染性心内膜炎的准确性。

Int Dent J. 2025 Feb;75(1):206-212. doi: 10.1016/j.identj.2024.09.033. Epub 2024 Oct 12.

Assessing large language models as assistive tools in medical consultations for Kawasaki disease.评估大型语言模型作为川崎病医疗咨询辅助工具的作用。

Front Artif Intell. 2025 Mar 31;8:1571503. doi: 10.3389/frai.2025.1571503. eCollection 2025.

Evaluating the reference accuracy of large language models in radiology: a comparative study across subspecialties.评估大型语言模型在放射学中的参考准确性：一项跨亚专业的比较研究。

Diagn Interv Radiol. 2025 May 12. doi: 10.4274/dir.2025.253101.

Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在“诊断请”案例中的诊断性能。

Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.

Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.用于简化介入放射学报告的大语言模型：一项比较分析

Acad Radiol. 2025 Feb;32(2):888-898. doi: 10.1016/j.acra.2024.09.041. Epub 2024 Sep 30.

本文引用的文献

Not all Costs Are Created Equal: What Are the Types of Costs and Why Do They Matter?并非所有成本都是一样的：成本有哪些类型以及它们为何重要？

J Arthroplasty. 2025 Feb;40(2):272-275. doi: 10.1016/j.arth.2024.11.033. Epub 2024 Nov 22.

Response to "The Impact of Preoperative Weight Loss Timing on Surgical Outcomes in Total Hip Arthroplasty".对《术前体重减轻时机对全髋关节置换术手术结果的影响》的回应

J Arthroplasty. 2024 Oct;39(10):e56. doi: 10.1016/j.arth.2024.04.014.

OCT analysis of preoperative foveal microstructure in recent-onset macula-off rhegmatogenous retinal detachment: visual acuity prognostic factors.OCT 分析近期发生黄斑裂孔性视网膜脱离术前黄斑区微结构：视力预后因素。

Br J Ophthalmol. 2024 Nov 22;108(12):1743-1748. doi: 10.1136/bjo-2024-325278.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement.评估ChatGPT对全膝关节置换常见问题的回答的准确性和相关性。

Knee Surg Relat Res. 2024 Apr 2;36(1):15. doi: 10.1186/s43019-024-00218-5.

Use and Application of Large Language Models for Patient Questions Following Total Knee Arthroplasty.全膝关节置换术后患者问题的大语言模型应用与实践

J Arthroplasty. 2024 Sep;39(9):2289-2294. doi: 10.1016/j.arth.2024.03.017. Epub 2024 Mar 13.

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks.系统分析 ChatGPT、Google 搜索和 Llama 2 在临床决策支持任务中的应用。

Nat Commun. 2024 Mar 6;15(1):2050. doi: 10.1038/s41467-024-46411-8.

Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases.探索 AI 聊天机器人在眼科手术规划方面的建议能力：ChatGPT 与 Google Gemini 对视网膜脱离病例的分析比较。

Br J Ophthalmol. 2024 Sep 20;108(10):1457-1469. doi: 10.1136/bjo-2023-325143.

Assessing Ability for ChatGPT to Answer Total Knee Arthroplasty-Related Questions.评估 ChatGPT 回答全膝关节置换术相关问题的能力。

J Arthroplasty. 2024 Aug;39(8):2022-2027. doi: 10.1016/j.arth.2024.02.023. Epub 2024 Feb 14.

Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients?ChatGPT 对全髋关节和膝关节置换患者来说是可靠的信息来源吗？

Bone Jt Open. 2024 Feb 15;5(2):139-146. doi: 10.1302/2633-1462.52.BJO-2023-0113.R1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验