Edelson Maxim, Kuo Tsung-Ting
UCSD Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA.
UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA.
JAMIA Open. 2022 May 25;5(2):ooac036. doi: 10.1093/jamiaopen/ooac036. eCollection 2022 Jul.
Predicting Coronavirus disease 2019 (COVID-19) mortality for patients is critical for early-stage care and intervention. Existing studies mainly built models on datasets with limited geographical range or size. In this study, we developed COVID-19 mortality prediction models on worldwide, large-scale "sparse" data and on a "dense" subset of the data.
We evaluated 6 classifiers, including logistic regression (LR), support vector machine (SVM), random forest (RF), multilayer perceptron (MLP), AdaBoost (AB), and Naive Bayes (NB). We also conducted temporal analysis and calibrated our models using Isotonic Regression.
The results showed that AB outperformed the other classifiers for the sparse dataset, while LR provided the highest-performing results for the dense dataset (with area under the receiver operating characteristic curve, or AUC ≈ 0.7 for the sparse dataset and AUC = 0.963 for the dense one). We also identified impactful features such as symptoms, countries, age, and the date of death/discharge. All our models are well-calibrated ( > .1).
Our results highlight the tradeoff of using sparse training data to increase generalizability versus training on denser data, which produces higher discrimination results. We found that covariates such as patient information on symptoms, countries (where the case was reported), age, and the date of discharge from the hospital or death were the most important for mortality prediction.
This study is a stepping-stone towards improving healthcare quality during the COVID-19 era and potentially other pandemics. Our code is publicly available at: https://doi.org/10.5281/zenodo.6336231.
预测2019冠状病毒病(COVID-19)患者的死亡率对于早期护理和干预至关重要。现有研究主要基于地理范围或规模有限的数据集构建模型。在本研究中,我们基于全球范围内的大规模“稀疏”数据以及该数据的“密集”子集开发了COVID-19死亡率预测模型。
我们评估了6种分类器,包括逻辑回归(LR)、支持向量机(SVM)、随机森林(RF)、多层感知器(MLP)、AdaBoost(AB)和朴素贝叶斯(NB)。我们还进行了时间分析,并使用保序回归对模型进行校准。
结果表明,对于稀疏数据集,AB的表现优于其他分类器,而LR在密集数据集上提供了最高的性能结果(稀疏数据集的受试者工作特征曲线下面积,即AUC≈0.7,密集数据集的AUC = 0.963)。我们还确定了有影响的特征,如症状、国家、年龄以及死亡/出院日期。我们所有的模型校准良好(>.1)。
我们的结果突出了使用稀疏训练数据以提高泛化能力与在更密集的数据上进行训练之间的权衡,后者会产生更高的判别结果。我们发现,诸如患者症状信息、国家(病例报告地)、年龄以及出院或死亡日期等协变量对于死亡率预测最为重要。
本研究是在COVID-19时代及可能的其他大流行期间提高医疗质量的一块垫脚石。我们的代码可在以下网址公开获取:https://doi.org/10.5281/zenodo.6336231。