Suppr超能文献

大语言模型在围手术期风险预测和预后中的应用。

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication.

机构信息

Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, California.

Department of Anesthesiology & Pain Medicine, University of Washington, Seattle.

出版信息

JAMA Surg. 2024 Aug 1;159(8):928-937. doi: 10.1001/jamasurg.2024.1621.

Abstract

IMPORTANCE

General-domain large language models may be able to perform risk stratification and predict postoperative outcome measures using a description of the procedure and a patient's electronic health record notes.

OBJECTIVE

To examine predictive performance on 8 different tasks: prediction of American Society of Anesthesiologists Physical Status (ASA-PS), hospital admission, intensive care unit (ICU) admission, unplanned admission, hospital mortality, postanesthesia care unit (PACU) phase 1 duration, hospital duration, and ICU duration.

DESIGN, SETTING, AND PARTICIPANTS: This prognostic study included task-specific datasets constructed from 2 years of retrospective electronic health records data collected during routine clinical care. Case and note data were formatted into prompts and given to the large language model GPT-4 Turbo (OpenAI) to generate a prediction and explanation. The setting included a quaternary care center comprising 3 academic hospitals and affiliated clinics in a single metropolitan area. Patients who had a surgery or procedure with anesthesia and at least 1 clinician-written note filed in the electronic health record before surgery were included in the study. Data were analyzed from November to December 2023.

EXPOSURES

Compared original notes, note summaries, few-shot prompting, and chain-of-thought prompting strategies.

MAIN OUTCOMES AND MEASURES

F1 score for binary and categorical outcomes. Mean absolute error for numerical duration outcomes.

RESULTS

Study results were measured on task-specific datasets, each with 1000 cases with the exception of unplanned admission, which had 949 cases, and hospital mortality, which had 576 cases. The best results for each task included an F1 score of 0.50 (95% CI, 0.47-0.53) for ASA-PS, 0.64 (95% CI, 0.61-0.67) for hospital admission, 0.81 (95% CI, 0.78-0.83) for ICU admission, 0.61 (95% CI, 0.58-0.64) for unplanned admission, and 0.86 (95% CI, 0.83-0.89) for hospital mortality prediction. Performance on duration prediction tasks was universally poor across all prompt strategies for which the large language model achieved a mean absolute error of 49 minutes (95% CI, 46-51 minutes) for PACU phase 1 duration, 4.5 days (95% CI, 4.2-5.0 days) for hospital duration, and 1.1 days (95% CI, 0.9-1.3 days) for ICU duration prediction.

CONCLUSIONS AND RELEVANCE

Current general-domain large language models may assist clinicians in perioperative risk stratification on classification tasks but are inadequate for numerical duration predictions. Their ability to produce high-quality natural language explanations for the predictions may make them useful tools in clinical workflows and may be complementary to traditional risk prediction models.

摘要

重要性

通用领域的大型语言模型可能能够通过描述手术过程和患者的电子健康记录笔记,对风险分层和预测术后结果进行评估。

目的

检验在 8 项不同任务上的预测性能:预测美国麻醉医师协会身体状况评分(ASA-PS)、住院、重症监护病房(ICU)入院、非计划性入院、住院死亡率、麻醉后护理单元(PACU)第 1 阶段持续时间、住院时间和 ICU 持续时间。

设计、设置和参与者:这项预后研究包括从 2 年的回顾性电子健康记录数据中构建的特定于任务的数据集,这些数据是在常规临床护理期间收集的。病例和笔记数据被格式化为提示,并提供给大型语言模型 GPT-4 Turbo(OpenAI)以生成预测和解释。该设置包括一个由 3 家学术医院和一个大都市地区的附属诊所组成的四级护理中心。研究包括接受了手术或有麻醉的患者,并且在手术前至少有 1 位临床医生书写的记录在电子健康记录中的患者。数据分析于 2023 年 11 月至 12 月进行。

暴露情况

与原始笔记、笔记摘要、少量提示和链式思维提示策略进行比较。

主要结果和措施

二元和分类结果的 F1 评分。数值持续时间结果的平均绝对误差。

结果

研究结果是在特定于任务的数据集上进行测量的,每个数据集有 1000 个病例,除了非计划性入院有 949 个病例,以及住院死亡率有 576 个病例。每个任务的最佳结果包括 ASA-PS 的 F1 评分为 0.50(95%CI,0.47-0.53),住院的 F1 评分为 0.64(95%CI,0.61-0.67),ICU 入院的 F1 评分为 0.81(95%CI,0.78-0.83),非计划性入院的 F1 评分为 0.61(95%CI,0.58-0.64),以及住院死亡率预测的 F1 评分为 0.86(95%CI,0.83-0.89)。对于所有提示策略,语言模型在持续时间预测任务上的性能普遍较差,对于 PACU 第 1 阶段持续时间预测,语言模型的平均绝对误差为 49 分钟(95%CI,46-51 分钟),对于住院持续时间预测为 4.5 天(95%CI,4.2-5.0 天),对于 ICU 持续时间预测为 1.1 天(95%CI,0.9-1.3 天)。

结论和相关性

当前的通用领域大型语言模型可能有助于临床医生进行围手术期风险分层的分类任务,但对于数值持续时间预测则不足。它们能够为预测生成高质量的自然语言解释,这可能使它们成为临床工作流程中的有用工具,并且可能与传统的风险预测模型互补。

相似文献

引用本文的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验