Lee Ji Hyun, Lee Hyun Woo, Lee Hyo Jin, Park Tae Yun, Jin Kwang Nam, Kim Dong Hyun, Ryu Borim
Department of Radiology, Seoul Metropolitan Government-Seoul National University Boramae Medical Center, 20, Boramae-ro 5-gil, Dongjak-gu, Seoul, Republic of Korea.
Division of Pulmonary and Critical Care Medicine, Department of Internal Medicine, Seoul National University College of Medicine, Seoul Metropolitan Government-Seoul National University Boramae Medical Center, 20, Boramae-ro 5-gil, Dongjak-gu, Seoul, Republic of Korea.
Sci Rep. 2025 Apr 10;15(1):12319. doi: 10.1038/s41598-025-95941-8.
This study investigated and validated all-cause in-hospital death prediction models for hospitalized pneumonia patients based on large-scale clinical data, including diagnoses, medication prescriptions, and laboratory test codes. Feature selection was performed using both large-scale feature learning with a Common Data Model (CDM) and specific pneumonia-related risk factors. A stacked ensemble mixed machine-learning model was compared with traditional machine-learning models. Accuracy, F1-score, the Area Under Precision Recall Curve (AUPRC) and the Area Under the Receiver Operating Characteristic (AUROC) were used for performance evaluation. For large-scale feature learning using a CDM, the ensemble model (LASSO LR + GBM + RF) achieved the highest performance. For the 365-day lookback, the ensemble model's AUROC was 0.867 (95% CI: 0.823-0.910), and for the 7-day lookback (AUROC 0.867, 95% CI: 0.822-0.912). In contrast, for feature learning based on selected pneumonia risk factors, among the traditional models, the RF model performed best with AUROCs of 0.774 (95% CI: 0.717-0.830) for the 365-day lookback and 0.773 (95% CI: 0.717-0.828) for the 7-days lookback. Leveraging large-scale feature learning within the CDM and using a stacked ensemble model predicts more accurately and robustly, highlighting the potential to capture complex relationships among clinical features and improve prognostic assessments.
本研究基于大规模临床数据(包括诊断、用药处方和实验室检查代码),对住院肺炎患者的全因院内死亡预测模型进行了研究和验证。使用通用数据模型(CDM)进行大规模特征学习以及特定的肺炎相关危险因素进行特征选择。将堆叠集成混合机器学习模型与传统机器学习模型进行比较。使用准确率、F1分数、精确率召回率曲线下面积(AUPRC)和受试者工作特征曲线下面积(AUROC)进行性能评估。对于使用CDM的大规模特征学习,集成模型(LASSO LR + GBM + RF)表现出最高性能。对于365天回顾期,集成模型的AUROC为0.867(95%置信区间:0.823 - 0.910),对于7天回顾期(AUROC 0.867,95%置信区间:0.822 - 0.912)。相比之下,对于基于选定肺炎危险因素的特征学习,在传统模型中,RF模型表现最佳,365天回顾期的AUROC为0.774(95%置信区间:0.717 - 0.830),7天回顾期的AUROC为0.773(95%置信区间:0.717 - 0.828)。利用CDM中的大规模特征学习并使用堆叠集成模型能够更准确、稳健地进行预测,突出了捕捉临床特征之间复杂关系以及改善预后评估的潜力。