Gholizadeh Mahdi, Saeedi Reza, Bagheri Amin, Paeezi Mohammad
Environmental and Occupational Hazards Control Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran; Department of Health, Safety and Environment, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Department of Health, Safety and Environment, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran; Workplace Health Promotion Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Environ Res. 2024 Apr 1;246:118146. doi: 10.1016/j.envres.2024.118146. Epub 2024 Jan 11.
Accurately predicting the characteristics of effluent, discharged from wastewater treatment plants (WWTPs) is crucial for reducing sampling requirements, labor, costs, and environmental pollution. Machine learning (ML) techniques can be effective in achieving this goal. To optimize ML-based models, various feature selection (FS) methods are employed. This study aims to investigate the impact of six FS methods (categorized as Wrapper, Filter, and Embedded methods) on the accuracy of three supervised ML algorithms in predicting total suspended solids (TSS) concentration in the effluent of a municipal wastewater treatment plant. Based on the features proposed by each FS method, five distinct scenarios were defined. Within each scenario, three ML algorithms, namely artificial neural network-multi layer perceptron (ANN-MLP), K-nearest neighbors (KNN), and adaptive boosting (AdaBoost) were applied. The features utilized for predicting TSS concentration in the WWTP effluent included BOD, COD, TSS, TN, NH in the influent, and BOD, COD, residual Cl NO, TN, NH in the effluent. To construct the models, the dataset was randomly divided into training and testing subsets, and K-fold cross-validation was employed to control overfitting and underfitting. The evaluation metrics that are used are root mean squared error (RMSE), mean absolute error (MAE), and correlation coefficient (R). The most efficient scenario was identified as Scenario IV, with the Sequential Backward Selection FS method. The features selected by this method were COD, BOD, BOD, TN. Furthermore, the ANN-MLP algorithm demonstrated the best performance, achieving the highest R value. This algorithm exhibited acceptable performance in both the training and testing subsets (R = 0.78 and R = 0.8, respectively).
准确预测污水处理厂(WWTPs)排放的废水特性对于减少采样需求、劳动力、成本和环境污染至关重要。机器学习(ML)技术可有效实现这一目标。为优化基于ML的模型,采用了各种特征选择(FS)方法。本研究旨在探讨六种FS方法(分为包装法、过滤法和嵌入法)对三种监督式ML算法预测城市污水处理厂出水总悬浮固体(TSS)浓度准确性的影响。基于每种FS方法提出的特征,定义了五种不同的场景。在每个场景中,应用了三种ML算法,即人工神经网络 - 多层感知器(ANN - MLP)、K近邻(KNN)和自适应增强(AdaBoost)。用于预测污水处理厂出水TSS浓度的特征包括进水的生化需氧量(BOD)、化学需氧量(COD)、TSS、总氮(TN)、氨氮(NH ),以及出水的BOD、COD、残余氯、硝酸盐氮(NO )、TN、NH 。为构建模型,将数据集随机分为训练子集和测试子集,并采用K折交叉验证来控制过拟合和欠拟合。使用的评估指标是均方根误差(RMSE)、平均绝对误差(MAE)和相关系数(R)。最有效的场景被确定为场景IV,采用的FS方法是顺序后向选择法。该方法选择的特征是COD、BOD、BOD、TN。此外,ANN - MLP算法表现出最佳性能,获得了最高的R值。该算法在训练子集和测试子集中均表现出可接受的性能(R分别为0.78和0.8)。