Suppr超能文献

大语言模型在医学培训评估中的应用——以ChatGPT作为标准化病人:多指标评估

Application of Large Language Models in Medical Training Evaluation-Using ChatGPT as a Standardized Patient: Multimetric Assessment.

作者信息

Wang Chenxu, Li Shuhan, Lin Nuoxi, Zhang Xinyu, Han Ying, Wang Xiandi, Liu Di, Tan Xiaomei, Pu Dan, Li Kang, Qian Guangwu, Yin Rong

机构信息

West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China.

Department of Industrial Engineering, Pittsburgh Institute, Sichuan University, Chengdu, China.

出版信息

J Med Internet Res. 2025 Jan 1;27:e59435. doi: 10.2196/59435.

Abstract

BACKGROUND

With the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to standardized patients, specifically for history-taking tasks.

OBJECTIVE

The study aims to explore ChatGPT's viability and performance as a standardized patient, using prompt engineering to refine its accuracy and use in medical assessments.

METHODS

A 2-phase experiment was conducted. The first phase assessed feasibility by simulating conversations about inflammatory bowel disease (IBD) across 3 quality groups (good, medium, and bad). Responses were categorized based on their relevance and accuracy. Each group consisted of 30 runs, with responses scored to determine whether they were related to the inquiries. For the second phase, we evaluated ChatGPT's performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT's response shortcomings, with a comparative analysis of ChatGPT's performance between original and revised prompts. A total of 300 runs were conducted and compared against standard reference scores. Finally, the generalizability of the revised prompt was tested using other scripts for another 60 runs, together with the exploration of the impact of the used language on the performance of the chatbot.

RESULTS

The feasibility test confirmed ChatGPT's ability to simulate a standardized patient effectively, differentiating among poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD 5.44) and medium (82.67, SD 5.30) inquiry groups (P<.001), between the poor and good (85, SD 3.27) inquiry groups (P<.001) were significant at a significance level (α) of .05, while the score differences between the medium and good inquiry groups were not statistically significant (P=.16). The revised prompt significantly improved ChatGPT's realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompts. The score difference percentage drops from 29.83% to 6.06%, with a drop in SD from 0.55 to 0.068. The performance of the chatbot on a separate script is acceptable with an average score difference percentage of 3.21%. Moreover, the performance differences between test groups using various language combinations were found to be insignificant.

CONCLUSIONS

ChatGPT, as a representative LLM, is a viable tool for simulating standardized patients in medical assessments, with the potential to enhance medical training. By incorporating proper prompts, ChatGPT's scoring accuracy and response realism significantly improved, approaching the feasibility of actual clinical use. Also, the influence of the adopted language is nonsignificant on the outcome of the chatbot.

摘要

背景

随着人们对大语言模型(LLMs)在医学领域应用的兴趣日益浓厚,其作为医学评估中标准化患者的潜在可行性却很少被评估。具体而言,我们深入探讨了使用具有代表性的大语言模型ChatGPT作为标准化患者的成本效益高的替代方案来改变医学教育的潜力,特别是在病史采集任务方面。

目的

本研究旨在探讨ChatGPT作为标准化患者的可行性和性能,利用提示工程来提高其在医学评估中的准确性和应用。

方法

进行了一个两阶段实验。第一阶段通过模拟关于炎症性肠病(IBD)的对话,在3个质量组(好、中、差)中评估可行性。根据回答的相关性和准确性进行分类。每个组由30次运行组成,对回答进行评分以确定它们是否与询问相关。对于第二阶段,我们根据特定标准评估ChatGPT的性能,重点关注其拟人化、临床准确性和适应性。根据ChatGPT的回答缺点对提示进行调整,并对原始提示和修订提示之间的ChatGPT性能进行比较分析。总共进行了300次运行,并与标准参考分数进行比较。最后,使用其他脚本再进行60次运行来测试修订提示的通用性,并探索所用语言对聊天机器人性能的影响。

结果

可行性测试证实了ChatGPT能够有效地模拟标准化患者,以不同程度的准确性区分差、中、好的医学询问。差(74.7,标准差5.44)和中(82.67,标准差5.30)询问组之间的分数差异(P<.001),差和好(85,标准差3.27)询问组之间的分数差异(P<.001)在显著性水平(α)为0.05时具有统计学意义,而中和好询问组之间的分数差异无统计学意义(P = 0.16)。修订后的提示显著提高了ChatGPT的真实感、临床准确性和适应性,导致评分差异显著降低。与未修订的提示相比,ChatGPT的分数准确性提高了4.926倍。分数差异百分比从29.83%降至6.06%,标准差从0.55降至0.068。聊天机器人在单独脚本上的性能是可接受的,平均分数差异百分比为3.21%。此外,发现使用各种语言组合的测试组之间的性能差异不显著。

结论

作为具有代表性的大语言模型,ChatGPT是医学评估中模拟标准化患者的可行工具,具有增强医学培训的潜力。通过纳入适当的提示,ChatGPT的评分准确性和回答真实感显著提高,接近实际临床使用的可行性。而且,所采用语言对聊天机器人的结果影响不显著。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44f0/11736217/2766508861e9/jmir_v27i1e59435_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验