Department of Engineering Physics, Tsinghua University, Beijing 100084, China; Beijing Key Laboratory of City Integrated Emergency Response Science, Tsinghua University, Beijing 100084, China.
Department of Engineering Physics, Tsinghua University, Beijing 100084, China; Beijing Key Laboratory of City Integrated Emergency Response Science, Tsinghua University, Beijing 100084, China.
Sci Total Environ. 2018 Sep 1;635:644-658. doi: 10.1016/j.scitotenv.2018.04.040. Epub 2018 Apr 24.
A stacked ensemble model is developed for forecasting and analyzing the daily average concentrations of fine particulate matter (PM) in Beijing, China. Special feature extraction procedures, including those of simplification, polynomial, transformation and combination, are conducted before modeling to identify potentially significant features based on an exploratory data analysis. Stability feature selection and tree-based feature selection methods are applied to select important variables and evaluate the degrees of feature importance. Single models including LASSO, Adaboost, XGBoost and multi-layer perceptron optimized by the genetic algorithm (GA-MLP) are established in the level 0 space and are then integrated by support vector regression (SVR) in the level 1 space via stacked generalization. A feature importance analysis reveals that nitrogen dioxide (NO) and carbon monoxide (CO) concentrations measured from the city of Zhangjiakou are taken as the most important elements of pollution factors for forecasting PM concentrations. Local extreme wind speeds and maximal wind speeds are considered to extend the most effects of meteorological factors to the cross-regional transportation of contaminants. Pollutants found in the cities of Zhangjiakou and Chengde have a stronger impact on air quality in Beijing than other surrounding factors. Our model evaluation shows that the ensemble model generally performs better than a single nonlinear forecasting model when applied to new data with a coefficient of determination (R) of 0.90 and a root mean squared error (RMSE) of 23.69μg/m. For single pollutant grade recognition, the proposed model performs better when applied to days characterized by good air quality than when applied to days registering high levels of pollution. The overall classification accuracy level is 73.93%, with most misclassifications made among adjacent categories. The results demonstrate the interpretability and generalizability of the stacked ensemble model.
针对中国北京市细颗粒物(PM)日均浓度的预测和分析,开发了一个堆叠集成模型。在建模之前,进行了特殊的特征提取程序,包括简化、多项式、变换和组合,以根据探索性数据分析确定潜在的显著特征。应用稳定性特征选择和基于树的特征选择方法来选择重要变量并评估特征重要性的程度。在 0 级空间中建立了包括 LASSO、Adaboost、XGBoost 和遗传算法(GA-MLP)优化的多层感知机在内的单模型,然后通过堆叠泛化在 1 级空间中通过支持向量回归(SVR)进行集成。特征重要性分析表明,来自张家口市的二氧化氮(NO)和一氧化碳(CO)浓度被视为预测 PM 浓度的污染因素的最重要元素。局部极端风速和最大风速被认为可以将气象因素的最大影响扩展到污染物的跨区域运输。与其他周边因素相比,张家口市和承德市的污染物对北京市的空气质量影响更大。我们的模型评估表明,与单一非线性预测模型相比,集成模型在应用于新数据时表现更好,决定系数(R)为 0.90,均方根误差(RMSE)为 23.69μg/m。对于单一污染物等级识别,当应用于空气质量良好的日子时,所提出的模型表现更好,而当应用于污染水平较高的日子时表现较差。总体分类准确率为 73.93%,大多数错误分类发生在相邻类别之间。结果表明堆叠集成模型具有可解释性和泛化能力。