Balch Jeremy A, Ruppert Matthew M, Guan Ziyuan, Buchanan Timothy R, Abbott Kenneth L, Shickel Benjamin, Bihorac Azra, Liang Muxuan, Upchurch Gilbert R, Tignanelli Christopher J, Loftus Tyler J
Department of Surgery, University of Florida, Gainesville.
Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville.
JAMA Surg. 2024 Dec 1;159(12):1424-1431. doi: 10.1001/jamasurg.2024.4299.
Machine learning tools are increasingly deployed for risk prediction and clinical decision support in surgery. Class imbalance adversely impacts predictive performance, especially for low-incidence complications.
To evaluate risk-prediction model performance when trained on risk-specific cohorts.
DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study performed from February 2024 to July 2024 deployed a deep learning model, which generated risk scores for common postoperative complications. A total of 109 445 inpatient operations performed at 2 University of Florida Health hospitals from June 1, 2014, to May 5, 2021 were examined.
The model was trained de novo on separate cohorts for high-risk, medium-risk, and low-risk Common Procedure Terminology codes defined empirically by incidence of 5 postoperative complications: (1) in-hospital mortality; (2) prolonged intensive care unit (ICU) stay (≥48 hours); (3) prolonged mechanical ventilation (≥48 hours); (4) sepsis; and (5) acute kidney injury (AKI). Low-risk and high-risk cutoffs for complications were defined by the lower-third and upper-third prevalence in the dataset, except for mortality, cutoffs for which were set at 1% or less and greater than 3%, respectively.
Model performance metrics were assessed for each risk-specific cohort alongside the baseline model. Metrics included area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), F1 scores, and accuracy for each model.
A total of 109 445 inpatient operations were examined among patients treated at 2 University of Florida Health hospitals in Gainesville (77 921 procedures [71.2%]) and Jacksonville (31 524 procedures [28.8%]). Median (IQR) patient age was 58 (43-68) years, and median (IQR) Charlson Comorbidity Index score was 2 (0-4). Among 109 445 operations, 55 646 patients were male (50.8%), and 66 495 patients (60.8%) underwent a nonemergent, inpatient operation. Training on the high-risk cohort had variable impact on AUROC, but significantly improved AUPRC (as assessed by nonoverlapping 95% confidence intervals) for predicting mortality (0.53; 95% CI, 0.43-0.64), AKI (0.61; 95% CI, 0.58-0.65), and prolonged ICU stay (0.91; 95% CI, 0.89-0.92). It also significantly improved F1 score for mortality (0.42; 95% CI, 0.36-0.49), prolonged mechanical ventilation (0.55; 95% CI, 0.52-0.58), sepsis (0.46; 95% CI, 0.43-0.49), and AKI (0.57; 95% CI, 0.54-0.59). After controlling for baseline model performance on high-risk cohorts, AUPRC increased significantly for in-hospital mortality only (0.53; 95% CI, 0.42-0.65 vs 0.29; 95% CI, 0.21-0.40).
In this cross-sectional study, by training separate models using a priori knowledge for procedure-specific risk classes, improved performance in standard evaluation metrics was observed, especially for low-prevalence complications like in-hospital mortality. Used cautiously, this approach may represent an optimal training strategy for surgical risk-prediction models.
机器学习工具越来越多地应用于手术中的风险预测和临床决策支持。类别不平衡会对预测性能产生不利影响,尤其是对于低发生率的并发症。
评估在特定风险队列上训练时风险预测模型的性能。
设计、设置和参与者:这项横断面研究于2024年2月至2024年7月进行,采用了深度学习模型,该模型生成了常见术后并发症的风险评分。对2014年6月1日至2021年5月5日在佛罗里达大学健康系统的2家医院进行的109445例住院手术进行了检查。
该模型根据5种术后并发症发生率经验性定义的高风险、中风险和低风险通用程序术语代码在单独的队列上从头开始训练:(1)住院死亡率;(2)重症监护病房(ICU)延长住院时间(≥48小时);(3)机械通气延长(≥48小时);(4)败血症;(5)急性肾损伤(AKI)。并发症的低风险和高风险临界值由数据集中患病率的下三分位数和上三分位数定义,但死亡率除外,其临界值分别设定为1%或更低和大于3%。
与基线模型一起评估每个特定风险队列的模型性能指标。指标包括受试者工作特征曲线下面积(AUROC)、精确召回率曲线下面积(AUPRC)、F1分数以及每个模型的准确性。
在盖恩斯维尔的佛罗里达大学健康系统的2家医院接受治疗的患者中,共检查了109445例住院手术(77921例手术[71.2%])和杰克逊维尔的(31524例手术[28.8%])。患者年龄中位数(IQR)为58(43 - 68)岁,Charlson合并症指数评分中位数(IQR)为2(0 - 4)。在109445例手术中,55646例患者为男性(50.8%),66495例患者(60.8%)接受了非急诊住院手术。在高风险队列上进行训练对AUROC有不同影响,但显著提高了预测死亡率(0.53;95%CI,0.43 - 0.64)、AKI(0.61;95%CI,0.58 - 0.65)和ICU延长住院时间(0.91;95%CI,0.89 - 0.92)的AUPRC(通过不重叠的95%置信区间评估)。它还显著提高了死亡率(0.42;95%CI,0.36 - 0.49)、机械通气延长(0.55;95%CI,0.52 - 0.58)、败血症(0.46;95%CI,0.43 - 0.49)和AKI(0.57;95%CI, 0.54 - 0.59)的F1分数。在控制高风险队列上的基线模型性能后,仅住院死亡率的AUPRC显著增加(0.53;95%CI,0.42 - 0.65对比0.29;95%CI,0.21 - 0.40)。
在这项横断面研究中,通过使用针对特定手术风险类别的先验知识训练单独的模型,观察到标准评估指标的性能有所提高,尤其是对于住院死亡率等低患病率并发症。谨慎使用时,这种方法可能代表手术风险预测模型的最佳训练策略。