套索正则化技术在减轻空气质量预测模型过拟合中的应用。

Application of the Lasso regularisation technique in mitigating overfitting in air quality prediction models.

作者信息

Pak Abbas, Rad Abdullah Kaviani, Nematollahi Mohammad Javad, Mahmoudi Mohammadreza

机构信息

Department of Computer Sciences, Shahrekord University, Shahrekord, Iran.

Department of Environmental Engineering and Natural Resources, College of Agriculture, Shiraz University, Shiraz, 71946-85111, Iran.

出版信息

Sci Rep. 2025 Jan 2;15(1):547. doi: 10.1038/s41598-024-84342-y.

DOI:10.1038/s41598-024-84342-y

PMID:39747344

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11696743/

Abstract

As a significant global concern, air pollution triggers enormous challenges in public health and ecological sustainability, necessitating the development of precise algorithms to forecast and mitigate its impacts, which has led to the development of many machine learning (ML)-based models for predicting air quality. Meanwhile, overfitting is a prevalent issue with ML algorithms that decreases their efficacy and generalizability. The present investigation, using an extensive collection of data from 16 sensors in Tehran, Iran, from 2013 to 2023, focuses on applying the Least Absolute Shrinkage and Selection Operator (Lasso) regularisation technique to enhance the forecasting precision of ambient air pollutants concentration models, including particulate matter (PM and PM), CO, NO, SO, and O while decreasing overfitting. The outputs were compared using the R-squared (R), mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), and normalised mean square error (NMSE) indices. Despite the preliminary findings revealing that Lasso dramatically enhances model reliability by decreasing overfitting and determining key attributes, the model's performance in predicting gaseous pollutants against PM remained unsatisfactory (R = 0.80, R = 0.75, R = 0.45, R = 0.55, R = 0.65, and R = 0.35). The minimal degree of missing data presumably explained the strong performance of the PM model, while the high dynamism of gases and their chemical interactions, in conjunction with the inherent characteristics of the model, were the primary factors contributing to the poor performance of the model. Simultaneously, the successful implementation of the Lasso regularisation approach in mitigating overfitting and selecting more important features makes it highly suggested for application in air quality forecasting models.

摘要

作为一个重大的全球问题，空气污染给公共卫生和生态可持续性带来了巨大挑战，因此需要开发精确的算法来预测和减轻其影响，这促使人们开发了许多基于机器学习（ML）的空气质量预测模型。同时，过拟合是ML算法中普遍存在的问题，会降低其有效性和通用性。本研究使用了来自伊朗德黑兰16个传感器在2013年至2023年期间的大量数据，重点应用最小绝对收缩和选择算子（Lasso）正则化技术来提高环境空气污染物浓度模型的预测精度，包括颗粒物（PM和PM）、一氧化碳（CO）、一氧化氮（NO）、二氧化硫（SO）和臭氧（O），同时减少过拟合。使用决定系数（R²）、平均绝对误差（MAE）、均方误差（MSE）、均方根误差（RMSE）和归一化均方误差（NMSE）指标对输出结果进行比较。尽管初步结果表明Lasso通过减少过拟合和确定关键属性显著提高了模型的可靠性，但该模型在预测气态污染物与颗粒物方面的性能仍不尽人意（R² = 0.80、R² = 0.75、R² = 0.45、R² = 0.55、R² = 0.65和R² = 0.35）。数据缺失程度最小可能解释了颗粒物模型的良好性能，而气体的高动态性及其化学相互作用，再加上模型的固有特性，是导致该模型性能不佳的主要因素。同时，Lasso正则化方法在减轻过拟合和选择更重要特征方面的成功实施，使其强烈建议应用于空气质量预测模型。