利用大语言模型识别晚期癌症患者的预先医疗计划

Large Language Models to Identify Advance Care Planning in Patients With Advanced Cancer.

作者信息

Agaronnik Nicole D, Davis Joshua, Manz Christopher R, Tulsky James A, Lindvall Charlotta

机构信息

Harvard Medical School (N.A., C.M., J.T., C.L.), Boston, Massachusetts, USA; Dana-Farber Cancer Institute (N.A., C.M., J.T., C.L.), Boston, Massachusetts, USA.

Dana-Farber Cancer Institute (N.A., C.M., J.T., C.L.), Boston, Massachusetts, USA; Albany Medical College (J.D.), Albany, New York, USA.

出版信息

J Pain Symptom Manage. 2025 Mar;69(3):243-250.e1. doi: 10.1016/j.jpainsymman.2024.11.016. Epub 2024 Nov 24.

DOI:10.1016/j.jpainsymman.2024.11.016

PMID:39586429

Abstract

CONTEXT

Efficiently tracking Advance Care Planning (ACP) documentation in electronic heath records (EHRs) is essential for quality improvement and research efforts. The use of large language models (LLMs) offers a novel approach to this task.

OBJECTIVES

To evaluate the ability of LLMs to identify ACP in EHRs for patients with advanced cancer and compare performance to gold-standard manual chart review and natural language processing (NLP).

METHODS

EHRs from patients with advanced cancer followed at seven Dana Farber Cancer Center (DFCI) clinics in June 2024. We utilized GPT-4o-2024-05-13 within DFCI's HIPAA-secure digital infrastructure. We designed LLM prompts to identify ACP domains: goals of care, limitation of life-sustaining treatment, hospice, and palliative care. We developed a novel hallucination index to measure production of factually-incorrect evidence by the LLM. Performance was compared to gold-standard manual chart review and NLP.

RESULTS

60 unique patients associated with 528 notes were used to construct the gold-standard data set. LLM prompts had sensitivity ranging from 0.85 to 1.0, specificity ranging from 0.80 to 0.91, and accuracy ranging from 0.81 to 0.91 across domains. The LLM had better sensitivity than NLP for identifying complex topics such as goals of care. Average hallucination index for notes identified by LLM was less than 0.5, indicating a low probability of hallucination. Despite lower precision compared to NLP, false positive documentation identified by LLMs was clinically-relevant and useful for guiding management.

CONCLUSION

LLMs can capture ACP domains from EHRs, with sensitivity exceeding NLP methods for complex domains such as goals of care. Future studies should explore approaches for scaling this methodology.

摘要

背景

在电子健康记录（EHR）中有效跟踪预立医疗计划（ACP）文档对于质量改进和研究工作至关重要。使用大语言模型（LLM）为这项任务提供了一种新方法。

目的

评估LLM识别晚期癌症患者EHR中ACP的能力，并将其性能与金标准人工病历审查和自然语言处理（NLP）进行比较。

方法

2024年6月在达纳-法伯癌症研究所（DFCI）的7家诊所随访的晚期癌症患者的EHR。我们在DFCI符合HIPAA安全标准的数字基础设施中使用了GPT-4o-2024-05-13。我们设计了LLM提示语以识别ACP领域：护理目标、维持生命治疗的限制、临终关怀和姑息治疗。我们开发了一种新的幻觉指数来衡量LLM产生事实错误证据的情况。将性能与金标准人工病历审查和NLP进行比较。

结果

与528份记录相关的60名独特患者被用于构建金标准数据集。LLM提示语在各个领域的敏感性范围为0.85至1.0，特异性范围为0.80至0.91，准确性范围为0.81至0.91。在识别护理目标等复杂主题方面，LLM比NLP具有更高的敏感性。LLM识别出的记录的平均幻觉指数小于0.5，表明幻觉可能性较低。尽管与NLP相比精度较低，但LLM识别出的假阳性文档在临床上具有相关性且有助于指导管理。