Environmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA; Department of Chemical and Biochemical Engineering, Rutgers University, Piscataway, NJ 08854, USA.
Environmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA; Department of Environmental Sciences, Rutgers University, New Brunswick, NJ 08901, USA.
Environ Int. 2020 Sep;142:105827. doi: 10.1016/j.envint.2020.105827. Epub 2020 Jun 25.
Spatial linear Land-Use Regression (LUR) is commonly used for long-term modeling of air pollution in support of exposure and epidemiological assessments. Machine Learning (ML) methods in conjunction with spatiotemporal modeling can provide more flexible exposure-relevant metrics and have been studied using different model structures. There is however a lack of comparisons of methods available within these two modeling frameworks, that can guide model/algorithm selection in air quality epidemiology.
The present study compares thirteen algorithms for spatial/spatiotemporal modeling applied for daily maxima of 8-hour running averages of ambient ozone concentrations at spatial resolutions corresponding to census tracts, to support estimation of annual ozone design values across the contiguous US. These algorithms were selected from nine representative categories and trained using predictors that included chemistry-transport model predictions, meteorological factors, land use and land cover, and stationary and mobile emissions.
To obtain the best predictive performance, model structures were optimized through a repeated coarse/fine grid search with expert knowledge. Six target-oriented validation strategies were used to prevent overfitting and avoid over-optimistic model evaluation results. In order to take full advantage of the power of different algorithms, we introduced tuning sample weights in spatiotemporal modeling to ensure predictive accuracy of peak concentrations, that is crucial for exposure assessments. In spatial modeling, four interpretation and visualization tools were introduced to explain predictions from different algorithms.
Nonlinear ML methods achieved higher prediction accuracy than linear LUR, and the improvements were more significant for spatiotemporal modeling (nearly 10%-40% decrease of predicted RMSE). By tuning the sample weights, spatiotemporal models can predict concentrations used to calculate ozone design values that are comparable or even better than spatial models (nearly 30% decrease of cross-validated RMSE). We visualized the underlying nonlinear relationships, heterogeneous associations and complex interactions from the two best performing ML algorithms, i.e., Random Forest and Extreme Gradient Boosting, and found that the complex patterns were relatively less significant with respect to model accuracy for spatial modeling.
Machine Learning can provide estimates that are actually more interpretable and practical than linear regression to improve accuracy in modeling human exposures. A careful design of hyperparameter tuning and flexible data splitting and validations is crucial to obtain reliable and stable results. Desirable/successful nonlinear models are expected to capture similar nonlinear patterns and interactions using different ML algorithms.
空间线性土地利用回归(LUR)常用于支持暴露和流行病学评估的长期空气污染建模。结合时空建模的机器学习(ML)方法可以提供更灵活的与暴露相关的指标,并已使用不同的模型结构进行了研究。然而,在这两种建模框架内,缺乏对可用方法的比较,这些方法可以指导空气质量流行病学中的模型/算法选择。
本研究比较了十三种应用于空间/时空建模的算法,用于模拟 8 小时运行平均值的日最大环境臭氧浓度,空间分辨率对应于普查区,以支持在美国大陆各地估算臭氧设计值。这些算法是从九个代表性类别中选择的,并使用包括化学传输模型预测、气象因素、土地利用和土地覆盖以及固定和移动排放的预测因子进行训练。
为了获得最佳预测性能,通过专家知识的重复粗/细网格搜索优化了模型结构。使用六种面向目标的验证策略来防止过度拟合并避免过度乐观的模型评估结果。为了充分利用不同算法的优势,我们在时空建模中引入了调整样本权重的方法,以确保预测峰值浓度的准确性,这对于暴露评估至关重要。在空间建模中,引入了四个解释和可视化工具来解释来自不同算法的预测。
非线性 ML 方法的预测精度高于线性 LUR,时空建模的改进更为显著(预测 RMSE 降低近 10%-40%)。通过调整样本权重,时空模型可以预测用于计算臭氧设计值的浓度,其预测结果与空间模型相当甚至更好(交叉验证 RMSE 降低近 30%)。我们可视化了两个表现最佳的 ML 算法(随机森林和极端梯度增强)的底层非线性关系、异质关联和复杂相互作用,并发现对于空间建模,复杂模式与模型精度的相关性相对较小。
机器学习可以提供比线性回归更具可解释性和实用性的估计,以提高建模人类暴露的准确性。仔细设计超参数调整和灵活的数据分割和验证对于获得可靠和稳定的结果至关重要。理想/成功的非线性模型有望使用不同的 ML 算法捕捉类似的非线性模式和相互作用。