Environ Sci Technol. 2018 Nov 20;52(22):13260-13269. doi: 10.1021/acs.est.8b02917. Epub 2018 Nov 1.
The long satellite aerosol data record enables assessments of historical PM level in regions where routine PM monitoring began only recently. However, most previous models reported decreased prediction accuracy when predicting PM levels outside the model-training period. In this study, we proposed an ensemble machine learning approach that provided reliable PM hindcast capabilities. The missing satellite data were first filled by multiple imputation. Then the modeling domain, China, was divided into seven regions using a spatial clustering method to control for unobserved spatial heterogeneity. A set of machine learning models including random forest, generalized additive model, and extreme gradient boosting were trained in each region separately. Finally, a generalized additive ensemble model was developed to combine predictions from different algorithms. The ensemble prediction characterized the spatiotemporal distribution of daily PM well with the cross-validation (CV) R (RMSE) of 0.79 (21 μg/m). The cluster-based subregion models outperformed national models and improved the CV R by ∼0.05. Compared with previous studies, our model provided more accurate out-of-range predictions at the daily level ( R = 0.58, RMSE = 29 μg/m) and monthly level ( R = 0.76, RMSE = 16 μg/m). Our hindcast modeling system allows for the construction of unbiased historical PM levels.
长卫星气溶胶数据记录使人们能够评估最近才开始进行常规 PM 监测的地区的历史 PM 水平。然而,大多数先前的模型在预测模型训练期之外的 PM 水平时报告了预测精度降低。在这项研究中,我们提出了一种集成机器学习方法,该方法提供了可靠的 PM 回溯能力。首先通过多次插补来填补缺失的卫星数据。然后,使用空间聚类方法将建模区域中国划分为七个区域,以控制未观察到的空间异质性。在每个区域中分别训练了一组包括随机森林、广义加性模型和极端梯度增强在内的机器学习模型。最后,开发了一个广义加性集成模型来组合来自不同算法的预测。该集成预测很好地描述了每日 PM 的时空分布,其交叉验证(CV)R(RMSE)为 0.79(21μg/m)。基于聚类的子区域模型优于全国模型,并将 CV R 提高了约 0.05。与先前的研究相比,我们的模型在每日水平(R=0.58,RMSE=29μg/m)和每月水平(R=0.76,RMSE=16μg/m)上提供了更准确的超范围预测。我们的回溯建模系统允许构建无偏的历史 PM 水平。