Department for Medical Statistics and Informatics, School of Medicine, University of Belgrade, Serbia.
Institute of Physics Belgrade, National Institute of the Republic of Serbia, University of Belgrade, Serbia.
Environ Res. 2021 Oct;201:111526. doi: 10.1016/j.envres.2021.111526. Epub 2021 Jun 24.
Many studies have proposed a relationship between COVID-19 transmissibility and ambient pollution levels. However, a major limitation in establishing such associations is to adequately account for complex disease dynamics, influenced by e.g. significant differences in control measures and testing policies. Another difficulty is appropriately controlling the effects of other potentially important factors, due to both their mutual correlations and a limited dataset. To overcome these difficulties, we will here use the basic reproduction number (R) that we estimate for USA states using non-linear dynamics methods. To account for a large number of predictors (many of which are mutually strongly correlated), combined with a limited dataset, we employ machine-learning methods. Specifically, to reduce dimensionality without complicating the variable interpretation, we employ Principal Component Analysis on subsets of mutually related (and correlated) predictors. Methods that allow feature (predictor) selection, and ranking their importance, are then used, including both linear regressions with regularization and feature selection (Lasso and Elastic Net) and non-parametric methods based on ensembles of weak-learners (Random Forest and Gradient Boost). Through these substantially different approaches, we robustly obtain that PM is a major predictor of R in USA states, with corrections from factors such as other pollutants, prosperity measures, population density, chronic disease levels, and possibly racial composition. As a rough magnitude estimate, we obtain that a relative change in R, with variations in pollution levels observed in the USA, is typically ~30%, which further underscores the importance of pollution in COVID-19 transmissibility.
许多研究提出了 COVID-19 传染性与环境污染物水平之间的关系。然而,建立这种关联的一个主要限制是要充分考虑复杂的疾病动态,这些动态受到例如控制措施和检测政策的显著差异的影响。另一个困难是要适当控制其他潜在重要因素的影响,这是由于它们相互关联和数据集有限。为了克服这些困难,我们将在这里使用我们使用非线性动力学方法为美国各州估计的基本繁殖数(R)。为了考虑大量的预测因子(其中许多是相互强烈相关的),并且数据集有限,我们采用机器学习方法。具体来说,为了在不使变量解释复杂化的情况下降低维度,我们在相互相关(和相关)预测因子的子集上使用主成分分析。然后使用允许特征(预测因子)选择和对其重要性进行排名的方法,包括具有正则化和特征选择的线性回归(Lasso 和 Elastic Net)以及基于弱学习者集合的非参数方法(随机森林和梯度提升)。通过这些截然不同的方法,我们稳健地得出 PM 是美国各州 R 的主要预测因子,其校正了其他污染物、繁荣措施、人口密度、慢性病水平等因素的影响,可能还有种族构成的影响。作为一个大致的数量估计,我们得出结论,在美国观察到的污染水平变化导致的 R 相对变化通常约为 30%,这进一步强调了污染在 COVID-19 传染性中的重要性。