Suppr超能文献

商用大型语言模型(ChatGPT)运用简单分诊与快速治疗(START)协议对模拟患者进行灾难分诊的准确性:再现性和可重复性研究。

Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study.

机构信息

Department of Emergency Medicine, University of Alberta, Edmonton, AB, Canada.

CRIMEDIM-Center for Research and Training in Disaster Medicine, Humanitarian Aid and Global Health, Universita' del Piemonte Orientale, Novara, Italy.

出版信息

J Med Internet Res. 2024 Sep 30;26:e55648. doi: 10.2196/55648.

Abstract

BACKGROUND

The release of ChatGPT (OpenAI) in November 2022 drastically reduced the barrier to using artificial intelligence by allowing a simple web-based text interface to a large language model (LLM). One use case where ChatGPT could be useful is in triaging patients at the site of a disaster using the Simple Triage and Rapid Treatment (START) protocol. However, LLMs experience several common errors including hallucinations (also called confabulations) and prompt dependency.

OBJECTIVE

This study addresses the research problem: "Can ChatGPT adequately triage simulated disaster patients using the START protocol?" by measuring three outcomes: repeatability, reproducibility, and accuracy.

METHODS

Nine prompts were developed by 5 disaster medicine physicians. A Python script queried ChatGPT Version 4 for each prompt combined with 391 validated simulated patient vignettes. Ten repetitions of each combination were performed for a total of 35,190 simulated triages. A reference standard START triage code for each simulated case was assigned by 2 disaster medicine specialists (JMF and MV), with a third specialist (LC) added if the first two did not agree. Results were evaluated using a gage repeatability and reproducibility study (gage R and R). Repeatability was defined as variation due to repeated use of the same prompt. Reproducibility was defined as variation due to the use of different prompts on the same patient vignette. Accuracy was defined as agreement with the reference standard.

RESULTS

Although 35,102 (99.7%) queries returned a valid START score, there was considerable variability. Repeatability (use of the same prompt repeatedly) was 14% of the overall variation. Reproducibility (use of different prompts) was 4.1% of the overall variation. The accuracy of ChatGPT for START was 63.9% with a 32.9% overtriage rate and a 3.1% undertriage rate. Accuracy varied by prompt with a maximum of 71.8% and a minimum of 46.7%.

CONCLUSIONS

This study indicates that ChatGPT version 4 is insufficient to triage simulated disaster patients via the START protocol. It demonstrated suboptimal repeatability and reproducibility. The overall accuracy of triage was only 63.9%. Health care professionals are advised to exercise caution while using commercial LLMs for vital medical determinations, given that these tools may commonly produce inaccurate data, colloquially referred to as hallucinations or confabulations. Artificial intelligence-guided tools should undergo rigorous statistical evaluation-using methods such as gage R and R-before implementation into clinical settings.

摘要

背景

2022 年 11 月,ChatGPT(OpenAI)的发布通过允许对大型语言模型(LLM)使用简单的基于网络的文本界面,极大地降低了使用人工智能的门槛。ChatGPT 的一个有用的用例是使用简单分诊与快速治疗(START)协议对灾难现场的患者进行分诊。然而,LLM 会出现几种常见错误,包括幻觉(也称为编造)和提示依赖。

目的

本研究解决了研究问题:“ChatGPT 是否可以通过使用 START 协议充分对模拟灾难患者进行分诊?”,通过测量三个结果:可重复性、可再现性和准确性。

方法

由 5 名灾害医学医生开发了 9 个提示。Python 脚本通过查询 ChatGPT Version 4 为每个提示与 391 个验证过的模拟患者病例组合,共执行了 10 次重复,总计模拟分诊 35,190 次。由 2 名灾害医学专家(JMF 和 MV)为每个模拟病例分配 START 分诊代码,如果前两位专家意见不一致,则由第三位专家(LC)添加。使用量具可重复性和再现性研究(量具 R 和 R)评估结果。可重复性定义为使用相同提示时的差异。再现性定义为使用同一患者病例的不同提示时的差异。准确性定义为与参考标准的一致性。

结果

尽管 35,102(99.7%)次查询返回了有效的 START 评分,但存在相当大的差异。可重复性(重复使用相同提示)占总变异的 14%。再现性(使用不同的提示)占总变异的 4.1%。ChatGPT 对 START 的准确性为 63.9%,过度分诊率为 32.9%,分诊不足率为 3.1%。准确性因提示而异,最高可达 71.8%,最低可达 46.7%。

结论

本研究表明,ChatGPT version 4 不足以通过 START 协议对模拟灾难患者进行分诊。它表现出不理想的可重复性和再现性。分诊的整体准确性仅为 63.9%。鉴于这些工具可能经常产生不准确的数据,通常称为幻觉或编造,建议医疗保健专业人员在使用商业 LLM 进行重要医疗决策时要谨慎。人工智能引导的工具应在投入临床使用之前,通过量具 R 和 R 等方法进行严格的统计评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec10/11474136/fd814b660e52/jmir_v26i1e55648_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验