Venkatraman Jagatha Janani, Schneider Christoph, Sauter Tobias
Geography Department, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany.
Sensors (Basel). 2024 Jun 27;24(13):4193. doi: 10.3390/s24134193.
Machine learning (ML) methods are widely used in particulate matter prediction modelling, especially through use of air quality sensor data. Despite their advantages, these methods' black-box nature obscures the understanding of how a prediction has been made. Major issues with these types of models include the data quality and computational intensity. In this study, we employed feature selection methods using recursive feature elimination and global sensitivity analysis for a random-forest (RF)-based land-use regression model developed for the city of Berlin, Germany. Land-use-based predictors, including local climate zones, leaf area index, daily traffic volume, population density, building types, building heights, and street types were used to create a baseline RF model. Five additional models, three using recursive feature elimination method and two using a Sobol-based global sensitivity analysis (GSA), were implemented, and their performance was compared against that of the baseline RF model. The predictors that had a large effect on the prediction as determined using both the methods are discussed. Through feature elimination, the number of predictors were reduced from 220 in the baseline model to eight in the parsimonious models without sacrificing model performance. The model metrics were compared, which showed that the parsimonious_GSA-based model performs better than does the baseline model and reduces the mean absolute error (MAE) from 8.69 µg/m to 3.6 µg/m and the root mean squared error (RMSE) from 9.86 µg/m to 4.23 µg/m when applying the trained model to reference station data. The better performance of the GSA_parsimonious model is made possible by the curtailment of the uncertainties propagated through the model via the reduction of multicollinear and redundant predictors. The parsimonious model validated against reference stations was able to predict the PM concentrations with an MAE of less than 5 µg/m for 10 out of 12 locations. The GSA_parsimonious performed best in all model metrics and improved the R from 3% in the baseline model to 17%. However, the predictions exhibited a degree of uncertainty, making it unreliable for regional scale modelling. The GSA_parsimonious model can nevertheless be adapted to local scales to highlight the land-use parameters that are indicative of PM concentrations in Berlin. Overall, population density, leaf area index, and traffic volume are the major predictors of PM, while building type and local climate zones are the less significant predictors. Feature selection based on sensitivity analysis has a large impact on the model performance. Optimising models through sensitivity analysis can enhance the interpretability of the model dynamics and potentially reduce computational costs and time when modelling is performed for larger areas.
机器学习(ML)方法在颗粒物预测建模中被广泛应用,尤其是通过空气质量传感器数据来进行预测。尽管这些方法具有诸多优点,但其黑箱性质使得人们难以理解预测是如何做出的。这类模型的主要问题包括数据质量和计算强度。在本研究中,我们针对为德国柏林市开发的基于随机森林(RF)的土地利用回归模型,采用了基于递归特征消除的特征选择方法和全局敏感性分析。基于土地利用的预测变量,包括局部气候区、叶面积指数、日交通流量、人口密度、建筑类型、建筑高度和街道类型,被用于创建一个基准RF模型。另外实施了五个模型,其中三个使用递归特征消除方法,两个使用基于索博尔的全局敏感性分析(GSA),并将它们的性能与基准RF模型进行比较。讨论了通过这两种方法确定的对预测有较大影响的预测变量。通过特征消除,预测变量的数量从基准模型中的220个减少到简约模型中的8个,同时不牺牲模型性能。对模型指标进行了比较,结果表明,基于简约_GSA的模型比基准模型表现更好,当将训练好的模型应用于参考站数据时,平均绝对误差(MAE)从8.69μg/m降至3.6μg/m,均方根误差(RMSE)从9.86μg/m降至4.23μg/m。通过减少多共线性和冗余预测变量,减少了通过模型传播的不确定性,使得GSA_简约模型具有更好的性能。针对参考站验证的简约模型能够在12个位置中的10个位置以小于5μg/m的MAE预测PM浓度。GSA_简约模型在所有模型指标中表现最佳,将R从基准模型中的3%提高到了17%。然而,预测表现出一定程度的不确定性,这使得它在区域尺度建模中不可靠。不过,GSA_简约模型可以适用于局部尺度,以突出表明柏林PM浓度的土地利用参数。总体而言,人口密度、叶面积指数和交通流量是PM的主要预测变量,而建筑类型和局部气候区是不太重要的预测变量。基于敏感性分析的特征选择对模型性能有很大影响。通过敏感性分析优化模型可以增强模型动态的可解释性,并有可能在对更大区域进行建模时降低计算成本和时间。