Suppr超能文献

通过角色扮演提示增强大语言模型的回答:关于全膝关节置换术常见问题解答的比较研究

Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.

作者信息

Chen Yi-Chen, Lee Sheng-Hsun, Sheu Huan, Lin Sheng-Hsuan, Hu Chih-Chien, Fu Shih-Chen, Yang Cheng-Pang, Lin Yu-Chih

机构信息

Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan.

Institute of Statistics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan.

出版信息

BMC Med Inform Decis Mak. 2025 May 23;25(1):196. doi: 10.1186/s12911-025-03024-5.

Abstract

BACKGROUND

The application of artificial intelligence (AI) in medical education and patient interaction is rapidly growing. Large language models (LLMs) such as GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus have shown potential in providing relevant medical information. This study aims to evaluate and compare the performance of these LLMs in answering frequently asked questions (FAQs) about Total Knee Arthroplasty (TKA), with a specific focus on the impact of role-playing prompts.

METHODS

Four leading LLMs-GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus-were evaluated using ten standardized patient inquiries related to TKA. Each model produced two distinct responses per question: one generated under zero-shot prompting (question-only), and one under role-playing prompting (instructed to simulate an experienced orthopaedic surgeon). Four orthopaedic surgeons evaluated responses for accuracy and comprehensiveness on a 5-point Likert scale, along with a binary measure for acceptability. Statistical analyses (Wilcoxon rank sum and Chi-squared tests; P < 0.05) were conducted to compare model performance.

RESULTS

ChatGPT-4 with role-playing prompts achieved the highest scores for accuracy (3.73), comprehensiveness (4.05), and acceptability (77.5%), followed closely by ChatGPT-3.5 with role-playing prompts (3.70, 3.85, 72.5%, respectively). Google Gemini and Claude 3 Opus demonstrated lower performance across all metrics. In between-model comparisons based on zero-shot prompting, ChatGPT-4 achieved significantly higher scores of both accuracy and comprehensiveness relative to Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002), and demonstrated higher acceptability than Claude 3 Opus (P = 0.006). Within-model comparisons showed role-playing significantly improved all metrics for ChatGPT-3.5 (P < 0.05) and acceptability for ChatGPT-4 (P = 0.033). No significant prompting effects were observed for Gemini or Claude.

CONCLUSIONS

This study demonstrates that role-playing prompts significantly enhance the performance of LLMs, particularly for ChatGPT-3.5 and ChatGPT-4, in answering FAQs related to TKA. ChatGPT-4, with role-playing prompts, showed superior performance in terms of accuracy, comprehensiveness, and acceptability. Despite occasional inaccuracies, LLMs hold promise for improving patient education and clinical decision-making in orthopaedic practice.

CLINICAL TRIAL NUMBER

Not applicable.

摘要

背景

人工智能(AI)在医学教育和患者互动中的应用正在迅速增长。诸如GPT-3.5、GPT-4、谷歌Gemini和Claude 3 Opus等大语言模型在提供相关医学信息方面已显示出潜力。本研究旨在评估和比较这些大语言模型在回答有关全膝关节置换术(TKA)的常见问题(FAQ)时的表现,特别关注角色扮演提示的影响。

方法

使用与TKA相关的10个标准化患者询问对四个领先的大语言模型——GPT-3.5、GPT-4、谷歌Gemini和Claude 3 Opus进行评估。每个模型对每个问题生成两个不同的回答:一个在零样本提示(仅问题)下生成,另一个在角色扮演提示(被指示模拟经验丰富的骨科医生)下生成。四名骨科医生以5分李克特量表评估回答的准确性和全面性,以及以二元指标评估可接受性。进行统计分析(威尔科克森秩和检验和卡方检验;P<0.05)以比较模型性能。

结果

带有角色扮演提示的ChatGPT-4在准确性(3.73)、全面性(4.05)和可接受性(77.5%)方面得分最高,紧随其后的是带有角色扮演提示的ChatGPT-3.5(分别为3.70、3.85、72.5%)。谷歌Gemini和Claude 3 Opus在所有指标上表现较低。在基于零样本提示的模型间比较中,ChatGPT-4在准确性和全面性方面的得分相对于谷歌Gemini(分别为P=0.031和P=0.009)和Claude 3 Opus(P=0.019和P=0.002)显著更高,并且其可接受性高于Claude 3 Opus(P=0.006)。模型内比较显示,角色扮演显著提高了ChatGPT-3.5的所有指标(P<0.05)以及ChatGPT-4的可接受性(P=0.033)。未观察到Gemini或Claude有显著的提示效果。

结论

本研究表明,角色扮演提示显著提高了大语言模型在回答与TKA相关的常见问题时的性能,特别是对于ChatGPT-3.5和ChatGPT-4。带有角色扮演提示的ChatGPT-4在准确性、全面性和可接受性方面表现出色。尽管偶尔存在不准确之处,但大语言模型有望改善骨科实践中的患者教育和临床决策。

临床试验编号

不适用。

相似文献

本文引用的文献

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验