Department of Community Health and Epidemiology, Faculty of Medicine, Dalhousie University, 5790 University Avenue, Halifax, B3H 1V7, NS, Canada.
Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, 80045 Aurora, Colorado, 80045, USA.
BMC Med Res Methodol. 2021 Nov 27;21(1):267. doi: 10.1186/s12874-021-01441-4.
Coronavirus disease (COVID-19) presents an unprecedented threat to global health worldwide. Accurately predicting the mortality risk among the infected individuals is crucial for prioritizing medical care and mitigating the healthcare system's burden. The present study aimed to assess the predictive accuracy of machine learning methods to predict the COVID-19 mortality risk.
We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID-19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated split-sample validation and k-steps-ahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier's score, calibration intercept and calibration slope.
We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional tree-based methods, i.e., classification tree or RF methods for predicting COVID-19 mortality risk. Regression-based methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier's scores.
XGBoost offers superior performance over conventional tree-based methods and minor improvement over regression-based methods for predicting COVID-19 mortality risk in the study population.
冠状病毒病(COVID-19)在全球范围内对全球健康构成了前所未有的威胁。准确预测感染者的死亡风险对于优先提供医疗护理和减轻医疗系统负担至关重要。本研究旨在评估机器学习方法预测 COVID-19 死亡风险的预测准确性。
我们比较了分类树、随机森林(RF)、极端梯度提升(XGBoost)、逻辑回归、广义加性模型(GAM)和线性判别分析(LDA)在预测 2020 年 3 月 1 日至 12 月 10 日期间在加拿大多伦多报告的 49,216 例 COVID-19 阳性病例的死亡风险中的性能。我们使用重复拆分样本验证和 k 步前瞻性验证。使用训练样本估计预测模型,并使用受试者工作特征曲线下的面积、Brier 得分、校准截距和校准斜率评估方法对测试样本的预测准确性。
我们发现 XGBoost 具有高度的辨别力,AUC 为 0.9669,并且在预测 COVID-19 死亡风险方面优于传统的基于树的方法,例如分类树或 RF 方法。基于回归的方法(逻辑、GAM 和 LASSO)与 XGBoost 的性能相当,AUC 略低,Brier 得分略高。
XGBoost 在预测研究人群中的 COVID-19 死亡风险方面优于传统的基于树的方法,并且略微优于基于回归的方法。