在衡量有记录的照护目标讨论方面对零样本大语言模型的评估

Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.

作者信息

Lee Robert Y, Li Kevin S, Sibley James, Cohen Trevor, Lober William B, Dotolo Danae G, Kross Erin K

出版信息

medRxiv. 2025 May 25:2025.05.23.25328115. doi: 10.1101/2025.05.23.25328115.

DOI:10.1101/2025.05.23.25328115

PMID:40475148

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12140542/

Abstract

IMPORTANCE

Goals-of-care (GOC) discussions and their documentation are an important process measure in palliative care. However, existing natural language processing (NLP) models for identifying GOC documentation require costly training data that do not transfer to other constructs of interest. Newer large language models (LLMs) hold promise for measuring linguistically complex constructs with fewer or no task-specific training.

OBJECTIVE

To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying EHR-documented GOC discussions.

DESIGN SETTING AND PARTICIPANTS

This diagnostic study compared performance in identifying electronic health record (EHR)-documented GOC discussions of two NLP models: Llama 3.3 using zero-shot prompting, and a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on a corpus of 4,642 manually annotated notes. Models were evaluated using text corpora drawn from clinical trials enrolling adult patients with chronic life-limiting illness hospitalized at a US health system over 2018-2023.

OUTCOMES AND MEASURES

The outcomes were NLP model performance, evaluated by the area under the Receiver Operating Characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F score. NLP performance was evaluated for both note-level and patient-level classification over a 30-day period.

RESULTS

Across three text corpora, GOC documentation represented <1% of EHR text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation. Llama 3.3 identified GOC documentation with AUC 0.979, AUPRC 0.873, and F 0.83; BERT identified the same with AUC 0.981, AUPRC 0.874, and F 0.83. In examining the cumulative incidence of GOC documentation over the specified 30-day period, Llama 3.3 identified patients with GOC documentation with AUC 0.977, AUPRC 0.955, and F 0.89; and BERT identified the same with AUC 0.981, AUPRC 0.952, and F 0.89.

CONCLUSIONS AND RELEVANCE

A zero-shot large language model with no task-specific training performs similarly to a task-specific supervised-learning BERT model trained on thousands of manually labeled EHR notes in identifying documented goals-of-care discussions. These findings demonstrate promise for rigorous use of LLMs in measuring novel clinical trial outcomes.

KEY POINTS

Can newer large language AI models accurately measure documented goals-of-care discussions without task-specific training data? In this diagnostic/prognostic study, a publicly available large language model prompted with an outcome definition and no task-specific training demonstrated comparable performance identifying documented goals-of-care discussions to a previous deep-learning model that had been trained on an annotated corpus of 4,642 notes. Natural language processing allows the measurement of previously-inaccessible outcomes for clinical research. Compared to traditional natural language processing and machine learning methods, newer large language AI models allow investigators to measure novel outcomes without needing costly training data.

摘要

重要性

照护目标（GOC）讨论及其记录是姑息治疗中的一项重要过程指标。然而，现有的用于识别GOC记录的自然语言处理（NLP）模型需要成本高昂的训练数据，且这些数据无法迁移到其他感兴趣的结构上。更新的大语言模型（LLM）有望在较少或无需特定任务训练的情况下测量语言复杂的结构。

目的

评估一个没有特定任务训练数据（零样本提示）的公开可用LLM识别电子健康记录（EHR）中记录的GOC讨论的性能。

设计、设置和参与者：这项诊断性研究比较了两个NLP模型在识别电子健康记录（EHR）中记录的GOC讨论方面的性能：使用零样本提示的Llama 3.3和基于特定任务的BERT（来自变换器的双向编码器表示）模型，该模型在4642条人工标注笔记的语料库上进行训练。使用从2018年至2023年在美国一个医疗系统住院的患有慢性生命受限疾病的成年患者的临床试验中抽取的文本语料库对模型进行评估。

结果和测量指标

结果是NLP模型的性能，通过受试者操作特征曲线（AUC）下的面积、精确召回率曲线（AUPRC）下的面积和最大F分数进行评估。在30天的时间内对笔记级和患者级分类的NLP性能进行评估。

结果

在三个文本语料库中，GOC记录占EHR文本的比例不到1%，在23%至37%的患者的7.3%至9.9%的笔记中被发现。在一个617名患者的保留测试集中，Llama 3.3（零样本）和BERT（特定任务，经过训练）在识别GOC记录方面表现出可比的性能。Llama 3.3识别GOC记录的AUC为0.979，AUPRC为0.873，F为0.83；BERT识别相同记录的AUC为0.981，AUPRC为0.874，F为0.83。在检查指定30天期间GOC记录的累积发生率时，Llama 3.3识别有GOC记录的患者的AUC为0.977，AUPRC为0.955，F为0.89；BERT识别相同患者的AUC为0.981，AUPRC为0.952，F为0.89。

结论和相关性

一个没有特定任务训练的零样本大语言模型在识别记录的照护目标讨论方面的表现与一个在数千条人工标注的EHR笔记上训练的特定任务监督学习BERT模型相似。这些发现表明在严格测量新的临床试验结果方面使用LLM具有前景。

关键点

更新的大语言AI模型能否在没有特定任务训练数据的情况下准确测量记录的照护目标讨论？在这项诊断/预后研究中，一个通过结果定义提示且没有特定任务训练的公开可用大语言模型在识别记录的照护目标讨论方面表现出与之前在4642条笔记的标注语料库上训练的深度学习模型相当的性能。自然语言处理允许测量临床研究中以前无法获得的结果。与传统的自然语言处理和机器学习方法相比，更新的大语言AI模型使研究人员能够在不需要成本高昂的训练数据的情况下测量新的结果。