用于中风护理的生成式大语言模型的性能评估。

Evaluation of performance of generative large language models for stroke care.

作者信息

Lee John Tayu, Li Vincent Cheng-Sheng, Wu Jia-Jyun, Chen Hsiao-Hui, Su Sophia Sin-Yu, Chang Brian Pin-Hsuan, Lai Richard Lee, Liu Chi-Hung, Chen Chung-Ting, Tanapima Valis, Shen Toby Kai-Bo, Atun Rifat

机构信息

Department of Global Health and Population, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.

Institute of Health Policy and Management, College of Public Health, National Taiwan University, Taipei, Taiwan.

出版信息

NPJ Digit Med. 2025 Jul 29;8(1):481. doi: 10.1038/s41746-025-01830-9.

DOI:10.1038/s41746-025-01830-9

PMID:40730644

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12307639/

Abstract

Stroke is a leading cause of global morbidity and mortality, disproportionately impacting lower socioeconomic groups. In this study, we evaluated three generative LLMs-GPT, Claude, and Gemini-across four stages of stroke care: prevention, diagnosis, treatment, and rehabilitation. Using three prompt engineering techniques-Zero-Shot Learning (ZSL), Chain of Thought (COT), and Talking Out Your Thoughts (TOT)-we applied each to realistic stroke scenarios. Clinical experts assessed the outputs across five domains: (1) accuracy; (2) hallucinations; (3) specificity; (4) empathy; and (5) actionability, based on clinical competency benchmarks. Overall, the LLMs demonstrated suboptimal performance with inconsistent scores across domains. Each prompt engineering method showed strengths in specific areas: TOT does well in empathy and actionability, COT was strong in structured reasoning during diagnosis, and ZSL provided concise, accurate responses with fewer hallucinations, especially in the Treatment stage. However, none consistently met high clinical standards across all stroke care stages.

摘要

中风是全球发病和死亡的主要原因，对社会经济地位较低的群体影响尤为严重。在本研究中，我们在中风护理的四个阶段（预防、诊断、治疗和康复）评估了三种生成式大语言模型（GPT、Claude和Gemini）。我们使用三种提示工程技术（零样本学习（ZSL）、思维链（COT）和说出你的想法（TOT）），并将每种技术应用于现实的中风场景。临床专家根据临床能力基准，在五个领域评估了输出结果：（1）准确性；（2）幻觉；（3）特异性；（4）同理心；（5）可操作性。总体而言，大语言模型表现欠佳，各领域得分不一致。每种提示工程方法在特定领域都有优势：TOT在同理心和可操作性方面表现良好，COT在诊断过程中的结构化推理方面表现出色，ZSL提供简洁、准确的回答，幻觉较少，尤其是在治疗阶段。然而，在所有中风护理阶段，没有一个模型始终能达到高临床标准。