评估大型语言模型在急诊科提供临床建议的应用。

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department.

机构信息

Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.

Department of Emergency Medicine, University of California, San Francisco, CA, USA.

出版信息

Nat Commun. 2024 Oct 8;15(1):8236. doi: 10.1038/s41467-024-52415-1.

The release of GPT-4 and other large language models (LLMs) has the potential to transform healthcare. However, existing research evaluating LLM performance on real-world clinical notes is limited. Here, we conduct a highly-powered study to determine whether LLMs can provide clinical recommendations for three tasks (admission status, radiological investigation(s) request status, and antibiotic prescription status) using clinical notes from the Emergency Department. We randomly selected 10,000 Emergency Department visits to evaluate the accuracy of zero-shot, GPT-3.5-turbo- and GPT-4-turbo-generated clinical recommendations across four different prompting strategies. We found that both GPT-4-turbo and GPT-3.5-turbo performed poorly compared to a resident physician, with accuracy scores 8% and 24%, respectively, lower than physician on average. Both LLMs tended to be overly cautious in its recommendations, with high sensitivity at the cost of specificity. Our findings demonstrate that, while early evaluations of the clinical use of LLMs are promising, LLM performance must be significantly improved before their deployment as decision support systems for clinical recommendations and other complex tasks.

GPT-4 和其他大型语言模型 (LLM) 的发布有可能改变医疗保健行业。然而，现有的评估 LLM 在真实临床笔记上性能的研究还很有限。在这里，我们进行了一项高影响力的研究，以确定 LLM 是否可以使用急诊科的临床记录为三个任务（入院状态、放射学检查请求状态和抗生素处方状态）提供临床建议。我们随机选择了 10000 次急诊科就诊，评估了零样本、GPT-3.5-turbo 和 GPT-4-turbo 在四种不同提示策略下生成的临床建议的准确性。我们发现，与住院医师相比，GPT-4-turbo 和 GPT-3.5-turbo 的表现都很差，准确性得分分别低了 8%和 24%。这两个 LLM 在其建议中往往过于谨慎，以特异性为代价提高了敏感性。我们的研究结果表明，尽管早期对 LLM 在临床应用的评估很有希望，但在将其作为临床建议和其他复杂任务的决策支持系统部署之前，必须显著提高 LLM 的性能。

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献