基于极端梯度提升的新冠肺炎患者分类方法。

eXtreme Gradient Boosting-based method to classify patients with COVID-19.

作者信息

Ramón Antonio, Torres Ana Maria, Milara Javier, Cascón Joaquín, Blasco Pilar, Mateo Jorge

机构信息

Pharmacy Department, General University Hospital Consortium of Valencia, Valencia, Spain.

Institute of Technology, Universidad de Castilla-La Mancha, Cuenca, Spain.

出版信息

J Investig Med. 2022 Jul 18. doi: 10.1136/jim-2021-002278.

DOI:10.1136/jim-2021-002278

PMID:35850970

Abstract

Different demographic, clinical and laboratory variables have been related to the severity and mortality following SARS-CoV-2 infection. Most studies applied traditional statistical methods and in some cases combined with a machine learning (ML) method. This is the first study to date to comparatively analyze five ML methods to select the one that most closely predicts mortality in patients admitted with COVID-19. The aim of this single-center observational study is to classify, based on different types of variables, adult patients with COVID-19 at increased risk of mortality. SARS-CoV-2 infection was defined by a positive reverse transcriptase PCR. A total of 203 patients were admitted between March 15 and June 15, 2020 to a tertiary hospital. Data were extracted from the electronic medical record. Four supervised ML algorithms (k-nearest neighbors (KNN), decision tree (DT), Gaussian naïve Bayes (GNB) and support vector machine (SVM)) were compared with the eXtreme Gradient Boosting (XGB) method proposed to have excellent scalability and high running speed, among other qualities. The results indicate that the XGB method has the best prediction accuracy (92%), high precision (>0.92) and high recall (>0.92). The KNN, SVM and DT approaches present moderate prediction accuracy (>80%), moderate recall (>0.80) and moderate precision (>0.80). The GNB algorithm shows relatively low classification performance. The variables with the greatest weight in predicting mortality were C reactive protein, procalcitonin, glutamyl oxaloacetic transaminase, glutamyl pyruvic transaminase, neutrophils, D-dimer, creatinine, lactic acid, ferritin, days of non-invasive ventilation, septic shock and age. Based on these results, XGB is a solid candidate for correct classification of patients with COVID-19.

摘要

不同的人口统计学、临床和实验室变量与新型冠状病毒2（SARS-CoV-2）感染后的严重程度和死亡率相关。大多数研究采用传统统计方法，在某些情况下还结合了机器学习（ML）方法。这是迄今为止第一项比较分析五种ML方法以选择最能准确预测COVID-19住院患者死亡率的研究。这项单中心观察性研究的目的是根据不同类型的变量对有较高死亡风险的成年COVID-19患者进行分类。SARS-CoV-2感染通过逆转录酶聚合酶链反应阳性来定义。2020年3月15日至6月15日期间，共有203名患者入住一家三级医院。数据从电子病历中提取。将四种监督式ML算法（k近邻算法（KNN）、决策树（DT）、高斯朴素贝叶斯（GNB）和支持向量机（SVM））与提出的具有出色可扩展性和高运行速度等特性的极端梯度提升（XGB）方法进行了比较。结果表明，XGB方法具有最佳预测准确率（92%）、高精度（>0.92）和高召回率（>0.92）。KNN、SVM和DT方法呈现出中等预测准确率（>80%）、中等召回率（>0.80）和中等精度（>0.80）。GNB算法显示出相对较低的分类性能。预测死亡率权重最大的变量是C反应蛋白、降钙素原、谷草转氨酶、谷丙转氨酶、中性粒细胞、D-二聚体、肌酐、乳酸、铁蛋白、无创通气天数、感染性休克和年龄。基于这些结果，XGB是对COVID-19患者进行正确分类的可靠候选方法。