Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
Digital Health Center, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany.
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage.
This study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models' responses can enhance the triage proficiency of untrained personnel.
A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters' MTS level assignments, measured via quadratic-weighted Cohen κ. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b.
GPT-4-based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (κ=mean 0.67, SD 0.037 and κ=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5-based ChatGPT (κ=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (κ=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4-based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained doctors undertriaged.
While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models' triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this study, failed to significantly improve untrained doctors' triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training.
大型语言模型(LLM)在各种医学领域表现出令人印象深刻的性能,促使人们探索它们在急诊部(ED)分诊这一高需求环境中的潜在应用。本研究评估了不同 LLM 和 ChatGPT(一种基于 LLM 的聊天机器人)在 ED 分诊中的分诊能力,以及它们与不同培训背景的专业 ED 工作人员和非专业人员的比较。我们还探讨了 LLM 能否指导非专业人员进行有效的分诊。
本研究旨在评估 LLM 及其相关产品 ChatGPT 在 ED 分诊中的效果,与不同培训背景的人员进行比较,并研究模型的响应是否可以提高非专业人员的分诊能力。
共对 124 名未接受培训的医生进行分诊;使用了不同版本的当前可用 LLM;ChatGPT;以及经过专业培训的评估人员,他们随后根据曼彻斯特分诊系统(MTS)达成了共识。原型病例取自德国一家三级 ED 的病例。主要结果是评估人员 MTS 水平分配的一致性,通过二次加权 Cohen κ 进行测量。还确定了分诊过度和不足的程度。值得注意的是,ChatGPT 的实例是通过零样本方法提示的,没有对 MTS 的广泛背景信息进行提示。测试的 LLM 包括原始 GPT-4、Llama 3 70B、Gemini 1.5 和 Mixtral 8x7b。
基于 GPT-4 的 ChatGPT 和未经培训的医生与专业评估人员的共识分诊有显著的一致性(κ=平均 0.67,SD 0.037 和 κ=平均 0.68,SD 0.056,分别),显著优于 GPT-3.5 为基础的 ChatGPT(κ=平均 0.54,SD 0.024;P<.001)。当未经培训的医生使用该 LLM 进行二次分诊时,性能略有但无统计学意义的提高(κ=平均 0.70,SD 0.047;P=.97)。其他测试的 LLM 与基于 GPT-4 的 ChatGPT 表现相似或更差,或者在使用的参数下表现出奇怪的分诊行为。LLM 和 ChatGPT 模型倾向于分诊过度,而未经培训的医生则分诊不足。
虽然 LLM 和基于 LLM 的产品 ChatGPT 尚未达到专业评估人员的水平,但它们的最佳模型的分诊能力与未经培训的 ED 医生相当。在目前的形式下,LLM 或 ChatGPT 在 ED 分诊中并没有表现出黄金标准的性能,在本研究中,当用作决策支持时,并没有显著提高未经培训的医生的分诊能力。在旧的 LLM 版本中,新的 LLM 版本在性能上的显著提高表明,随着技术的进一步发展和特定的培训,未来的性能将会有所提高。