Jožef Stefan Institute, 1000 Ljubljana, Slovenia.
Int J Environ Res Public Health. 2021 Jun 23;18(13):6750. doi: 10.3390/ijerph18136750.
The COVID-19 pandemic affected the whole world, but not all countries were impacted equally. This opens the question of what factors can explain the initial faster spread in some countries compared to others. Many such factors are overshadowed by the effect of the countermeasures, so we studied the early phases of the infection when countermeasures had not yet taken place. We collected the most diverse dataset of potentially relevant factors and infection metrics to date for this task. Using it, we show the importance of different factors and factor categories as determined by both statistical methods and machine learning (ML) feature selection (FS) approaches. Factors related to culture (e.g., individualism, openness), development, and travel proved the most important. A more thorough factor analysis was then made using a novel rule discovery algorithm. We also show how interconnected these factors are and caution against relying on ML analysis in isolation. Importantly, we explore potential pitfalls found in the methodology of similar work and demonstrate their impact on COVID-19 data analysis. Our best models using the decision tree classifier can predict the infection class with roughly 80% accuracy.
新冠疫情影响了全世界,但并非所有国家都受到同等程度的影响。这就提出了一个问题,即哪些因素可以解释为什么在某些国家,新冠病毒的传播速度比其他国家更快。许多此类因素都被防控措施的效果所掩盖,因此我们研究了在尚未采取防控措施时疫情的早期阶段。为此,我们收集了迄今为止最广泛的潜在相关因素和感染指标数据集。利用该数据集,我们通过统计方法和机器学习(ML)特征选择(FS)方法,展示了不同因素和因素类别的重要性。与文化(例如个人主义、开放性)、发展和旅行相关的因素被证明是最重要的。然后,我们使用一种新的规则发现算法对因素进行了更深入的分析。我们还展示了这些因素之间是如何相互关联的,并警告不要孤立地依赖 ML 分析。重要的是,我们探讨了在类似工作的方法学中发现的潜在缺陷,并展示了它们对新冠数据分析的影响。我们使用决策树分类器的最佳模型可以预测感染类别,准确率约为 80%。