Abdelmalek Mostafa M, Mahmoud Hatem, Shokry Hassan
Environmental Engineering Department, Egypt-Japan University of Science and Technology, Alexandria, 21934, Egypt.
Mining and Metallurgical Engineering Department, Faculty of Engineering, Assiut University, Assiut, 71516, Egypt.
Sci Rep. 2025 Jul 17;15(1):25890. doi: 10.1038/s41598-025-11260-y.
Air pollution constitutes a significant challenge for both public health and environmental sustainability. Pollutants like PM, O, NO, SO, and CO cause serious health problems and ecological damage. This study utilizes five machine learning (ML) models, which are Gaussian Process Regression (GPR), Ensemble Regression (ER), Support Vector Machine (SVM), Regression Tree (RT), and Kernel Approximation Regression (KAR), which are developed and compared to predict the Air Quality Index (AQI). The publicly available historical air pollution dataset, collected from 1st January to 31st December 2022, was obtained from the online source titled 'A Real-time Dataset of Air Pollution Monitoring Generated Using IoT-Mendeley Data', developed by the Department of Software Engineering, Daffodil International University. While the dataset includes six pollutants (PM, PM, NO, SO, CO, and O), only three-PM, PM, and CO-were selected for AQI prediction based on their higher feature importance as determined using the Random Forest technique. To streamline the time and cost consumed in measuring and analyzing these pollutants, the five ML models were employed to predict the AQI using only these three essential features. The findings reveal that GPR, ER, SVM, and RT ML models exhibited higher accuracy levels, achieving over 96% AQI prediction, whereas the KAR model was less accurate, with an accuracy of 82.36%. The comparative analysis revealed that the GPR model outperformed the other ML models with a minimum Root Mean Square Error (RMSE) of 0.87 and 1.219 during the training and testing, respectively. The findings highlight the value of ML in enhancing air quality prediction and monitoring, offering accurate tools for hourly data analysis and potential real-time application. Such tools can assist in devising more efficient air pollution control strategies, contributing to improved public health and environmental sustainability.
空气污染对公众健康和环境可持续性都构成了重大挑战。像颗粒物(PM)、臭氧(O₃)、氮氧化物(NOₓ)、二氧化硫(SO₂)和一氧化碳(CO)等污染物会导致严重的健康问题和生态破坏。本研究利用了五种机器学习(ML)模型,即高斯过程回归(GPR)、集成回归(ER)、支持向量机(SVM)、回归树(RT)和核近似回归(KAR),对这些模型进行开发并比较,以预测空气质量指数(AQI)。公开可用的历史空气污染数据集收集于2022年1月1日至12月31日,来自名为“A Real-time Dataset of Air Pollution Monitoring Generated Using IoT-Mendeley Data”的在线资源,由达芙妮国际大学软件工程系开发。虽然该数据集包含六种污染物(PM、PM₁₀、NOₓ、SO₂、CO和O₃),但基于使用随机森林技术确定的较高特征重要性,仅选择了三种污染物——PM、PM₁₀和CO——用于AQI预测。为了简化测量和分析这些污染物所消耗的时间和成本,使用这五种ML模型仅利用这三个基本特征来预测AQI。研究结果表明,GPR、ER、SVM和RT机器学习模型表现出更高的准确率,AQI预测准确率超过96%,而KAR模型的准确率较低,为82.36%。对比分析表明GPR模型在其他ML模型中表现最佳,在训练和测试期间的最小均方根误差(RMSE)分别为0.87和1.219。研究结果突出了机器学习在改善空气质量预测和监测方面的价值,为每小时数据分析和潜在的实时应用提供了准确的工具。此类工具可协助制定更有效的空气污染控制策略,有助于改善公众健康和环境可持续性。