Suppr超能文献

使用检索增强大语言模型预测术后30天死亡率和美国麻醉医师协会身体状况:开发与验证研究

Predicting 30-Day Postoperative Mortality and American Society of Anesthesiologists Physical Status Using Retrieval-Augmented Large Language Models: Development and Validation Study.

作者信息

Chen Ying-Hao, Ruan Shanq-Jang, Chen Pei-Fu

机构信息

Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan.

Department of Anesthesiology, Far Eastern Memorial Hospital, New Taipei City, Taiwan.

出版信息

J Med Internet Res. 2025 Jun 3;27:e75052. doi: 10.2196/75052.

Abstract

BACKGROUND

Accurately assessing perioperative risk is critical for informed surgical planning and patient safety. However, current prediction models often rely on structured data and overlook the nuanced clinical reasoning embedded in free-text preoperative notes. Recent advances in large language models (LLMs) have opened opportunities for harnessing unstructured clinical data, yet their application in perioperative prediction remains limited by concerns about factual accuracy. Retrieval-augmented generation (RAG) offers a promising solution-enhancing LLM performance by grounding outputs in domain-specific knowledge sources, potentially improving both predictive accuracy and clinical interpretability.

OBJECTIVE

This study aimed to investigate whether integrating LLMs with RAG can improve the prediction of 30-day postoperative mortality and American Society of Anesthesiologists (ASA) physical status classification using unstructured preoperative clinical notes.

METHODS

We conducted a retrospective cohort study using 24,491 medical records from a tertiary medical center, including preoperative anesthesia assessments, discharge summaries, and surgical information. To extract clinical insights from free-text data, we used the LLaMA 3.1-8B language model with RAG, using MedEmbed for text embedding and Miller's Anesthesia as the primary retrieval source. We evaluated model performance under various configurations, including embedding models, chunk sizes, and few-shot prompting. Machine learning (ML) models, including random forest, support vector machines (SVM), Extreme Gradient Boosting (XGBoost), and logistic regression, were trained on structured features as baselines.

RESULTS

A total of 520 (2.1%) patients experienced in-hospital 30-day postoperative mortality. The ASA physical status distribution was as follows: class I: 535 (2.2%); class II: 15,272 (62.4%); class III: 8024 (32.8%); class IV: 606 (2.5%); and class V: 54 (0.22%). For 30-day postoperative mortality prediction, the LLaMA‑RAG model achieved an F-score of 0.4663 (95% CI 0.4654-0.4672), versus 0.2369 (95% CI 0.2341-0.2397) without few‑shot prompting, 0.0879 (95% CI 0.0717-0.1041) without RAG, and 0.0436 (95% CI 0.0292-0.0580) without either few‑shot prompting or RAG. Among ML models, XGBoost scored 0.4459 (95% CI 0.4176-0.4742); random forest, 0.3953 (95% CI 0.3791-0.4115); logistic regression, 0.2720 (95% CI 0.2647-0.2793); and SVM, 0.2474 (95% CI 0.2275-0.2673). For ASA classification, LLaMA‑RAG achieved a micro F-score of 0.8409 (95% CI 0.8238-0.8551) versus 0.6546 (95% CI 0.6430-0.6796) without few-shot prompting, 0.6340 (95% CI 0.6157-0.6535) without RAG, and 0.4238 (95% CI 0.3952-0.4490) without either few‑shot prompting or RAG. In comparison, XGBoost achieved 0.8273 (95% CI 0.8209-0.8498); logistic regression, 0.7940 (95% CI 0.7671-0.7950); random forest, 0.7847 (95% CI 0.7637-0.7868); and SVM, 0.7697 (95% CI 0.7637-0.7697). Notably, the model demonstrated exceptional sensitivity in identifying rare but high-risk cases, such as ASA Class 5 patients and postoperative deaths.

CONCLUSIONS

The LLaMA-RAG model significantly improved the prediction of postoperative mortality and ASA classification, especially for rare high-risk cases. By grounding outputs in domain knowledge, retrieval-augmented generation enhanced both accuracy and prompt‑driven interpretability over ML and ablation models-highlighting its promise for real-world clinical decision support.

摘要

背景

准确评估围手术期风险对于明智的手术规划和患者安全至关重要。然而,当前的预测模型通常依赖结构化数据,而忽略了术前自由文本记录中蕴含的细微临床推理。大语言模型(LLMs)的最新进展为利用非结构化临床数据提供了机会,但其在围手术期预测中的应用仍因对事实准确性的担忧而受到限制。检索增强生成(RAG)提供了一个有前景的解决方案——通过将输出基于特定领域的知识源来提高大语言模型的性能,有可能提高预测准确性和临床可解释性。

目的

本研究旨在探讨将大语言模型与检索增强生成相结合是否能利用术前非结构化临床记录改善对术后30天死亡率和美国麻醉医师协会(ASA)身体状况分类的预测。

方法

我们进行了一项回顾性队列研究,使用了来自一家三级医疗中心的24491份病历,包括术前麻醉评估、出院小结和手术信息。为了从自由文本数据中提取临床见解,我们使用了带有检索增强生成的LLaMA 3.1 - 8B语言模型,使用MedEmbed进行文本嵌入,并将《米勒麻醉学》作为主要检索源。我们在各种配置下评估模型性能,包括嵌入模型、分块大小和少样本提示。机器学习(ML)模型,包括随机森林、支持向量机(SVM)、极端梯度提升(XGBoost)和逻辑回归,在结构化特征上进行训练作为基线。

结果

共有520名(2.1%)患者在术后30天内发生院内死亡。ASA身体状况分布如下:I级:535名(2.2%);II级:15272名(62.4%);III级:8024名(32.8%);IV级:606名(2.5%);V级:54名(0.22%)。对于术后30天死亡率预测,LLaMA - RAG模型的F值为0.4663(95%CI 0.4654 - 0.4672),相比之下,无少样本提示时为0.2369(95%CI 0.2341 - 0.2397),无检索增强生成时为0.0879(95%CI 0.0717 - 0.1041),无少样本提示和检索增强生成时为0.0436(95%CI 0.0292 - 0.0580)。在机器学习模型中,XGBoost得分为0.4459(95%CI 0.4176 - 0.4742);随机森林为0.3953(95%CI 0.3791 - 0.4115);逻辑回归为0.2720(95%CI 0.2647 - 0.2793);支持向量机为0.2474(95%CI 0.2275 - 0.2673)。对于ASA分类,LLaMA - RAG的微F值为0.8409(95%CI 0.8238 - 0.8551),相比之下,无少样本提示时为0.6546(95%CI 0.6430 - 0.6796),无检索增强生成时为0.6340(95%CI 0.6157 - 0.6535),无少样本提示和检索增强生成时为0.4238(95%CI 0.3952 - 0.4490)。相比之下,XGBoost为0.8273(95%CI 0.8209 - 0.8498);逻辑回归为0.7940(95%CI 0.7671 - 0.7950);随机森林为0.7847(95%CI 0.7637 - 0.7868);支持向量机为0.7697(95%CI 0.7637 - 0.7697)。值得注意的是,该模型在识别罕见但高风险的病例方面表现出出色的敏感性,如ASA 5级患者和术后死亡病例。

结论

LLaMA-RAG模型显著改善了术后死亡率预测和ASA分类,特别是对于罕见的高风险病例。通过将输出基于领域知识,检索增强生成提高了准确性和提示驱动的可解释性,超过了机器学习和消融模型,突出了其在现实世界临床决策支持中的前景。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bfc/12174870/9b7281cba3f2/jmir_v27i1e75052_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验