Cátedras Conacyt, National Council on Science and Technology, Mexico City, Mexico.
Center for Research in Geospatial Information Sciences, Mexico City, Mexico.
Front Public Health. 2022 Jun 30;10:912099. doi: 10.3389/fpubh.2022.912099. eCollection 2022.
The fast, exponential increase of COVID-19 infections and their catastrophic effects on patients' health have required the development of tools that support health systems in the quick and efficient diagnosis and prognosis of this disease. In this context, the present study aims to identify the potential factors associated with COVID-19 infections, applying machine learning techniques, particularly random forest, chi-squared, xgboost, and rpart for feature selection; ROSE and SMOTE were used as resampling methods due to the existence of class imbalance. Similarly, machine and deep learning algorithms such as support vector machines, C4.5, random forest, rpart, and deep neural networks were explored during the train/test phase to select the best prediction model. The dataset used in this study contains clinical data, anthropometric measurements, and other health parameters related to smoking habits, alcohol consumption, quality of sleep, physical activity, and health status during confinement due to the pandemic associated with COVID-19. The results showed that the XGBoost model got the best features associated with COVID-19 infection, and random forest approximated the best predictive model with a balanced accuracy of 90.41% using SMOTE as a resampling technique. The model with the best performance provides a tool to help prevent contracting SARS-CoV-2 since the variables with the highest risk factor are detected, and some of them are, to a certain extent controllable.
COVID-19 感染的快速、指数级增长及其对患者健康的灾难性影响,要求开发工具来支持卫生系统快速、有效地诊断和预测这种疾病。在这种情况下,本研究旨在应用机器学习技术,特别是随机森林、卡方检验、xgboost 和 rpart 进行特征选择,识别与 COVID-19 感染相关的潜在因素;由于存在类别不平衡,使用 ROSE 和 SMOTE 作为重采样方法。同样,在训练/测试阶段还探索了机器和深度学习算法,如支持向量机、C4.5、随机森林、rpart 和深度神经网络,以选择最佳预测模型。本研究使用的数据集包含与 COVID-19 相关的临床数据、人体测量学测量值以及与吸烟习惯、饮酒、睡眠质量、身体活动和大流行期间禁闭健康状况有关的其他健康参数。结果表明,XGBoost 模型获得了与 COVID-19 感染相关的最佳特征,随机森林使用 SMOTE 作为重采样技术,以 90.41%的平衡准确率逼近最佳预测模型。表现最佳的模型提供了一种帮助预防感染 SARS-CoV-2 的工具,因为可以检测到具有最高风险因素的变量,其中一些在一定程度上是可以控制的。