AlJame Maryam, Ahmad Imtiaz, Imtiaz Ayyub, Mohammed Ameer
Computer Engineering Department, Kuwait University, Kuwait.
College of Medicine, Kuwait University, Kuwait.
Inform Med Unlocked. 2020;21:100449. doi: 10.1016/j.imu.2020.100449. Epub 2020 Oct 20.
The pandemic of novel coronavirus disease 2019 (COVID-19) has severely impacted human society with a massive death toll worldwide. There is an urgent need for early and reliable screening of COVID-19 patients to provide better and timely patient care and to combat the spread of the disease. In this context, recent studies have reported some key advantages of using routine blood tests for initial screening of COVID-19 patients. In this article, first we present a review of the emerging techniques for COVID-19 diagnosis using routine laboratory and/or clinical data. Then, we propose ERLX which is an ensemble learning model for COVID-19 diagnosis from routine blood tests.
The proposed model uses three well-known diverse classifiers, extra trees, random forest and logistic regression, which have different architectures and learning characteristics at the first level, and then combines their predictions by using a second level extreme gradient boosting (XGBoost) classifier to achieve a better performance. For data preparation, the proposed methodology employs a KNNImputer algorithm to handle null values in the dataset, isolation forest (iForest) to remove outlier data, and a synthetic minority oversampling technique (SMOTE) to balance data distribution. For model interpretability, features importance are reported by using the SHapley Additive exPlanations (SHAP) technique.
The proposed model was trained and evaluated by using a publicly available data set from Albert Einstein Hospital in Brazil, which consisted of 5644 data samples with 559 confirmed COVID-19 cases. The ensemble model achieved outstanding performance with an overall accuracy of 99.88% [95% CI: 99.6-100], AUC of 99.38% [95% CI: 97.5-100], a sensitivity of 98.72% [95% CI: 94.6-100] and a specificity of 99.99% [95% CI: 99.99-100].
The proposed model revealed better performance when compared against existing state-of-the-art studies (Banerjee et al., 2020; de Freitas Barbosa et al., 2020; de Moraes Batista et al., 2020; Soares et al., 2020) [3,22,56,71] for the same set of features employed by them. As compared to the best performing Bayes Net model (de Freitas Barbosa et al., 2020) [22] average accuracy of 95.159%, ERLX achieved an average accuracy of 99.94%. In comparison with AUC of 85% reported by the SVM model (de Moraes Batista et al., 2020) [56], ERLX obtained AUC of 99.77% in addition to improvements in sensitivity, and specificity. As compared with ER-COV model (Soares et al., 2020) [71] average sensitivity of 70.25% and specificity of 85.98%, ERLX model achieved sensitivity of 99.47% and specificity of 99.99%. The ERLX model obtained a considerably higher score as compared with ANN model (Banerjee et al., 2020) [3] in all performance metrics. Therefore, the model presented is robust and can be deployed for reliable early and rapid screening of COVID-19 patients.
2019年新型冠状病毒病(COVID-19)大流行给人类社会带来了严重影响,全球死亡人数众多。迫切需要对COVID-19患者进行早期且可靠的筛查,以便提供更好、更及时的患者护理并抗击疾病传播。在此背景下,近期研究报告了使用常规血液检测对COVID-19患者进行初步筛查的一些关键优势。在本文中,首先我们对利用常规实验室和/或临床数据进行COVID-19诊断的新兴技术进行综述。然后,我们提出了ERLX,这是一种基于常规血液检测进行COVID-19诊断的集成学习模型。
所提出的模型在第一层级使用了三个知名的不同分类器,即极端随机树、随机森林和逻辑回归,它们具有不同的架构和学习特性,然后通过使用第二层级的极端梯度提升(XGBoost)分类器来组合它们的预测结果,以实现更好的性能。对于数据准备,所提出的方法采用KNNImputer算法处理数据集中的缺失值,使用隔离森林(iForest)去除异常数据,并采用合成少数过采样技术(SMOTE)来平衡数据分布。对于模型可解释性,通过使用SHapley Additive exPlanations(SHAP)技术报告特征重要性。
所提出的模型使用来自巴西阿尔伯特·爱因斯坦医院的公开可用数据集进行训练和评估,该数据集包含5644个数据样本,其中有559例确诊的COVID-19病例。该集成模型表现出色,总体准确率为99.88%[95%置信区间:99.6 - 100],曲线下面积(AUC)为99.38%[95%置信区间:97.5 - 100],灵敏度为98.72%[95%置信区间:94.6 - 100],特异性为99.99%[95%置信区间:99.99 - 100]。
与现有最先进的研究(Banerjee等人,2020年;de Freitas Barbosa等人,2020年;de Moraes Batista等人,2020年;Soares等人,2020年)[3,22,56,71]针对相同特征集进行比较时,所提出的模型表现出更好的性能。与表现最佳的贝叶斯网络模型(de Freitas Barbosa等人,2020年)[22]平均准确率95.159%相比,ERLX的平均准确率达到了99.94%。与支持向量机模型(de Moraes Batista等人,2020年)[56]报告的AUC为85%相比,ERLX除了在灵敏度和特异性方面有所提高外,还获得了99.77%的AUC。与ER-COV模型(Soares等人,2020年)[71]平均灵敏度70.25%和特异性85.98%相比,ERLX模型的灵敏度为99.47%,特异性为99.99%。在所有性能指标方面,ERLX模型与人工神经网络模型(Banerjee等人,2020年)[3]相比得分要高得多。因此,所提出的模型是稳健的,可用于对COVID-19患者进行可靠的早期快速筛查。