Swiss Tropical and Public Health Institute, Allschwil, Switzerland; University of Basel, Basel, Switzerland.
Federal Office of Meteorology and Climatology MeteoSwiss, Switzerland.
Environ Res. 2024 Dec 15;263(Pt 1):119999. doi: 10.1016/j.envres.2024.119999. Epub 2024 Sep 20.
Statistical and machine learning models are commonly used to estimate spatial and temporal variability in exposure to environmental stressors, supporting epidemiological studies. We aimed to compare the performances, strengths and limitations of six different algorithms in the retrospective spatiotemporal modeling of daily birch and grass pollen concentrations at a spatial resolution of 1 km across Switzerland.
Daily birch and grass pollen concentrations were available from 14 measurement sites in Switzerland for 2000-2019. To develop the spatiotemporal models, we considered spatiotemporal, spatial and temporal predictors including meteorological factors, land-use, elevation, species distribution and Normalized Difference Vegetation Index (NDVI). We used six statistical and machine learning algorithms: LASSO, Ridge, Elastic net, Random forest, XGBoost and ANNs. We optimized model structures through feature selection and grid search techniques to obtain the best predictive performance. We used train-test split and cross-validation to avoid overfitting and overoptimistic performance indicators. We then combined these six models through multiple linear regression to develop an ensemble hybrid model.
The 5-95 percentiles of birch and grass pollen concentrations were 0-151 and 0-105 grains/m, respectively. The hybrid ensemble model achieved the best RMSE on the test dataset for both birch and grass pollen with 94.4 and 19.7 grains/m, respectively. Nonlinear models (Random forest, XGBoost and ANNs) achieved lower test RMSE's than linear models (LASSO, Ridge, Elastic net) for both pollen types, with RMSE's ranging from 105.9 to 140.5 grains/m for birch and from 20.0 to 25.4 grains/m for grass pollen. The Random forest algorithm yielded the best spatial and temporal performance among the six evaluated modelling methods. The ensemble hybrid model outperformed the six linear and nonlinear algorithms. Country-wide pollen concentration, land use, weather, and NDVI were important predictors.
Nonlinear algorithms outperformed linear models and accurately explained complex, nonlinear relationships between environmental factors and measured concentrations.
统计和机器学习模型常用于估计环境应激物暴露的时空变异性,为流行病学研究提供支持。我们旨在比较六种不同算法在瑞士 1 公里空间分辨率下回溯性时空建模每日桦树和草花粉浓度中的表现、优势和局限性。
瑞士 14 个测量点提供了 2000-2019 年的每日桦树和草花粉浓度数据。为了开发时空模型,我们考虑了时空、空间和时间预测因子,包括气象因素、土地利用、海拔、物种分布和归一化植被指数(NDVI)。我们使用了六种统计和机器学习算法:LASSO、Ridge、Elastic net、随机森林、XGBoost 和人工神经网络(ANNs)。我们通过特征选择和网格搜索技术优化模型结构,以获得最佳预测性能。我们使用训练-测试分割和交叉验证来避免过拟合和过于乐观的性能指标。然后,我们通过多元线性回归将这六种模型结合起来,开发了一个集成混合模型。
桦树和草花粉浓度的 5-95 百分位数分别为 0-151 和 0-105 粒/m。对于桦树和草花粉,混合集成模型在测试数据集上的 RMSE 最好,分别为 94.4 和 19.7 粒/m。非线性模型(随机森林、XGBoost 和 ANNs)的测试 RMSE 低于线性模型(LASSO、Ridge、Elastic net),对于两种花粉类型,RMSE 范围分别为桦树花粉 105.9 至 140.5 粒/m,草花粉 20.0 至 25.4 粒/m。在六种评估的建模方法中,随机森林算法在时空性能方面表现最佳。集成混合模型优于六种线性和非线性算法。全国范围内的花粉浓度、土地利用、天气和 NDVI 是重要的预测因子。
非线性算法优于线性模型,可以准确地解释环境因素与测量浓度之间复杂的非线性关系。