利用大语言模型识别老年人的药物停用机会：回顾性队列研究。

Identifying Deprescribing Opportunities With Large Language Models in Older Adults: Retrospective Cohort Study.

作者信息

Socrates Vimig, Wright Donald S, Huang Thomas, Fereydooni Soraya, Dien Christine, Chi Ling, Albano Jesse, Patterson Brian, Sasidhar Kanaparthy Naga, Wright Catherine X, Loza Andrew, Chartash David, Iscoe Mark, Taylor Richard Andrew

机构信息

Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States.

Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States.

出版信息

JMIR Aging. 2025 Apr 11;8:e69504. doi: 10.2196/69504.

DOI:10.2196/69504

PMID:40215480

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12032504/

Abstract

BACKGROUND

Polypharmacy, the concurrent use of multiple medications, is prevalent among older adults and associated with increased risks for adverse drug events including falls. Deprescribing, the systematic process of discontinuing potentially inappropriate medications, aims to mitigate these risks. However, the practical application of deprescribing criteria in emergency settings remains limited due to time constraints and criteria complexity.

OBJECTIVE

This study aims to evaluate the performance of a large language model (LLM)-based pipeline in identifying deprescribing opportunities for older emergency department (ED) patients with polypharmacy, using 3 different sets of criteria: Beers, Screening Tool of Older People's Prescriptions, and Geriatric Emergency Medication Safety Recommendations. The study further evaluates LLM confidence calibration and its ability to improve recommendation performance.

METHODS

We conducted a retrospective cohort study of older adults presenting to an ED in a large academic medical center in the Northeast United States from January 2022 to March 2022. A random sample of 100 patients (712 total oral medications) was selected for detailed analysis. The LLM pipeline consisted of two steps: (1) filtering high-yield deprescribing criteria based on patients' medication lists, and (2) applying these criteria using both structured and unstructured patient data to recommend deprescribing. Model performance was assessed by comparing model recommendations to those of trained medical students, with discrepancies adjudicated by board-certified ED physicians. Selective prediction, a method that allows a model to abstain from low-confidence predictions to improve overall reliability, was applied to assess the model's confidence and decision-making thresholds.

RESULTS

The LLM was significantly more effective in identifying deprescribing criteria (positive predictive value: 0.83; negative predictive value: 0.93; McNemar test for paired proportions: χ=5.985; P=.02) relative to medical students, but showed limitations in making specific deprescribing recommendations (positive predictive value=0.47; negative predictive value=0.93). Adjudication revealed that while the model excelled at identifying when there was a deprescribing criterion related to one of the patient's medications, it often struggled with determining whether that criterion applied to the specific case due to complex inclusion and exclusion criteria (54.5% of errors) and ambiguous clinical contexts (eg, missing information; 39.3% of errors). Selective prediction only marginally improved LLM performance due to poorly calibrated confidence estimates.

CONCLUSIONS

This study highlights the potential of LLMs to support deprescribing decisions in the ED by effectively filtering relevant criteria. However, challenges remain in applying these criteria to complex clinical scenarios, as the LLM demonstrated poor performance on more intricate decision-making tasks, with its reported confidence often failing to align with its actual success in these cases. The findings underscore the need for clearer deprescribing guidelines, improved LLM calibration for real-world use, and better integration of human-artificial intelligence workflows to balance artificial intelligence recommendations with clinician judgment.

摘要

背景

多重用药，即同时使用多种药物，在老年人中很普遍，并且与包括跌倒在内的药物不良事件风险增加相关。减药，即停用潜在不适当药物的系统过程，旨在降低这些风险。然而，由于时间限制和标准复杂性，减药标准在急诊环境中的实际应用仍然有限。

目的

本研究旨在评估基于大语言模型（LLM）的流程在识别患有多重用药的老年急诊科（ED）患者的减药机会方面的性能，使用3套不同的标准：《Beers标准》、《老年人处方筛查工具》和《老年急诊用药安全建议》。该研究进一步评估了LLM的置信度校准及其改善推荐性能的能力。

方法

我们对2022年1月至2022年3月在美国东北部一家大型学术医疗中心急诊科就诊的老年人进行了一项回顾性队列研究。随机抽取100名患者（共712种口服药物）进行详细分析。LLM流程包括两个步骤：（1）根据患者的用药清单筛选高收益减药标准，（2）使用结构化和非结构化患者数据应用这些标准以推荐减药。通过将模型推荐与受过训练的医学生的推荐进行比较来评估模型性能，差异由获得董事会认证的急诊科医生裁定。应用选择性预测（一种允许模型放弃低置信度预测以提高整体可靠性的方法）来评估模型的置信度和决策阈值。

结果

相对于医学生，LLM在识别减药标准方面显著更有效（阳性预测值：0.83；阴性预测值：0.93；配对比例的McNemar检验：χ=5.985；P=.02），但在做出具体的减药推荐方面存在局限性（阳性预测值=0.47；阴性预测值=0.93）。裁定显示，虽然该模型在识别与患者的一种药物相关的减药标准时表现出色，但由于复杂的纳入和排除标准（54.5%的错误）以及模糊的临床背景（例如，信息缺失；39.3%的错误），它在确定该标准是否适用于特定病例时常常遇到困难。由于置信度估计校准不佳，选择性预测仅略微提高了LLM的性能。

结论

本研究强调了LLM通过有效筛选相关标准来支持急诊科减药决策的潜力。然而，将这些标准应用于复杂的临床场景仍存在挑战，因为LLM在更复杂的决策任务中表现不佳，其报告的置信度在这些情况下往往与其实际成功率不一致。研究结果强调需要更清晰的减药指南、改进LLM在实际应用中的校准，以及更好地整合人机智能工作流程，以平衡人工智能推荐与临床医生的判断。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

利用大语言模型识别老年人的药物停用机会：回顾性队列研究。

Identifying Deprescribing Opportunities With Large Language Models in Older Adults: Retrospective Cohort Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献

利用大语言模型识别老年人的药物停用机会：回顾性队列研究。

Identifying Deprescribing Opportunities With Large Language Models in Older Adults: Retrospective Cohort Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献