Department of Cancer Control and Population Health, Graduate School of Cancer Science and Policy, National Cancer Center, Goyang-si, Gyeonggi-do, Korea.
Department of Environmental and Occupational Health Sciences, University of Washington, Seattle, WA, United States of America.
PLoS One. 2020 Feb 18;15(2):e0228535. doi: 10.1371/journal.pone.0228535. eCollection 2020.
National-scale empirical models for air pollution can include hundreds of geographic variables. The impact of model parsimony (i.e., how model performance differs for a large versus small number of covariates) has not been systematically explored. We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants during 1979-2015; (2) explore systematically the impact on model performance of the number of variables selected for inclusion in a model; and (3) provide publicly available model predictions. We compute annual-average concentrations from regulatory monitoring data for PM10, PM2.5, NO2, SO2, CO, and ozone at all monitoring sites for 1979-2015. We also use ~350 geographic characteristics at each location including measures of traffic, land use, land cover, and satellite-based estimates of air pollution. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of geographic variables. For all pollutants and years, we compare three approaches for choosing variables to include in the PLS model: (1) no variables, (2) a limited number of variables selected from the full set by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional and spatially-clustered test data. Models using 3 to 30 variables selected from the full set generally have the best performance across all pollutants and years (median R2 conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 [0]) or all variables (0.64 [0.27]). Concentration estimates for all Census Blocks reveal generally decreasing concentrations over several decades with local heterogeneity. Our findings suggest that national prediction models can be built by empirically selecting only a small number of important variables to provide robust concentration estimates. Model estimates are freely available online.
全国性的空气污染经验模型可能包含数百个地理变量。模型简约性(即模型对于较大或较小数量的协变量的性能差异)的影响尚未得到系统探索。我们的目标是:(1)为 1979-2015 年期间美国本土六个污染物建立年度平均综合经验地理(IEG)回归模型;(2)系统探索模型中包含的变量数量对模型性能的影响;(3)提供公共可用的模型预测。我们从 1979-2015 年的监管监测数据中计算了 PM10、PM2.5、NO2、SO2、CO 和臭氧的年平均浓度,这些数据来自所有监测站点。我们还使用了每个位置约 350 个地理特征,包括交通、土地利用、土地覆盖以及基于卫星的空气污染估计。然后,我们采用普适克里金和偏最小二乘(PLS)法对地理变量进行汇总因子估计,开发 IEG 模型。对于所有污染物和年份,我们比较了三种选择 PLS 模型中包含变量的方法:(1)不选择变量;(2)从全集中通过向前选择选择有限数量的变量;(3)选择所有变量。我们使用 10 折交叉验证(CV)并使用常规和空间聚类测试数据来评估模型性能。与不选择(0.37 [0])或所有变量(0.64 [0.27])的模型相比,从全集中选择 3 到 30 个变量的模型在所有污染物和年份的性能通常最佳(常规[聚类]CV 的中位数 R2:0.66 [0.47])。所有普查块的浓度估计显示,在几十年的时间里,浓度普遍呈下降趋势,存在局部异质性。我们的研究结果表明,可以通过经验选择少量重要变量来构建全国性的预测模型,以提供稳健的浓度估计。模型估计值可在网上免费获取。