使用临床数据比较大规模和选定特征学习在社区获得性肺炎预后预测中的应用：一种堆叠集成方法。

Comparing large scale and selected feature learning for community acquired pneumonia prognosis prediction using clinical data: a stacked ensemble approach.

作者信息

Lee Ji Hyun, Lee Hyun Woo, Lee Hyo Jin, Park Tae Yun, Jin Kwang Nam, Kim Dong Hyun, Ryu Borim

机构信息

Department of Radiology, Seoul Metropolitan Government-Seoul National University Boramae Medical Center, 20, Boramae-ro 5-gil, Dongjak-gu, Seoul, Republic of Korea.

Division of Pulmonary and Critical Care Medicine, Department of Internal Medicine, Seoul National University College of Medicine, Seoul Metropolitan Government-Seoul National University Boramae Medical Center, 20, Boramae-ro 5-gil, Dongjak-gu, Seoul, Republic of Korea.

出版信息

Sci Rep. 2025 Apr 10;15(1):12319. doi: 10.1038/s41598-025-95941-8.

DOI:10.1038/s41598-025-95941-8

PMID:40210962

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11985930/

Abstract

This study investigated and validated all-cause in-hospital death prediction models for hospitalized pneumonia patients based on large-scale clinical data, including diagnoses, medication prescriptions, and laboratory test codes. Feature selection was performed using both large-scale feature learning with a Common Data Model (CDM) and specific pneumonia-related risk factors. A stacked ensemble mixed machine-learning model was compared with traditional machine-learning models. Accuracy, F1-score, the Area Under Precision Recall Curve (AUPRC) and the Area Under the Receiver Operating Characteristic (AUROC) were used for performance evaluation. For large-scale feature learning using a CDM, the ensemble model (LASSO LR + GBM + RF) achieved the highest performance. For the 365-day lookback, the ensemble model's AUROC was 0.867 (95% CI: 0.823-0.910), and for the 7-day lookback (AUROC 0.867, 95% CI: 0.822-0.912). In contrast, for feature learning based on selected pneumonia risk factors, among the traditional models, the RF model performed best with AUROCs of 0.774 (95% CI: 0.717-0.830) for the 365-day lookback and 0.773 (95% CI: 0.717-0.828) for the 7-days lookback. Leveraging large-scale feature learning within the CDM and using a stacked ensemble model predicts more accurately and robustly, highlighting the potential to capture complex relationships among clinical features and improve prognostic assessments.

摘要

本研究基于大规模临床数据（包括诊断、用药处方和实验室检查代码），对住院肺炎患者的全因院内死亡预测模型进行了研究和验证。使用通用数据模型（CDM）进行大规模特征学习以及特定的肺炎相关危险因素进行特征选择。将堆叠集成混合机器学习模型与传统机器学习模型进行比较。使用准确率、F1分数、精确率召回率曲线下面积（AUPRC）和受试者工作特征曲线下面积（AUROC）进行性能评估。对于使用CDM的大规模特征学习，集成模型（LASSO LR + GBM + RF）表现出最高性能。对于365天回顾期，集成模型的AUROC为0.867（95%置信区间：0.823 - 0.910），对于7天回顾期（AUROC 0.867，95%置信区间：0.822 - 0.912）。相比之下，对于基于选定肺炎危险因素的特征学习，在传统模型中，RF模型表现最佳，365天回顾期的AUROC为0.774（95%置信区间：0.717 - 0.830），7天回顾期的AUROC为0.773（95%置信区间：0.717 - 0.828）。利用CDM中的大规模特征学习并使用堆叠集成模型能够更准确、稳健地进行预测，突出了捕捉临床特征之间复杂关系以及改善预后评估的潜力。