Ma Yanyan, Qiao Yiheng, Chen Mengxue, Rui Dongni, Zhang Xuxiang, Liu Weijing, Ye Lin
State Key Laboratory of Pollution Control and Resource Reuse, School of Environment, Nanjing University, Nanjing 210023, China.
Nanjing Gaoke Environmental Technology Co., Ltd., Nanjing 210038, China.
Water Res. 2025 Apr 15;274:123041. doi: 10.1016/j.watres.2024.123041. Epub 2024 Dec 25.
Wastewater treatment plants (WWTPs) generate vast amounts of water quality, operational, and biological data. The potential of these big data, particularly through machine learning (ML), to improve WWTP management is increasingly recognized. However, the costs associated with data collection and processing can rise sharply as datasets grow larger, and research on determining the optimal data volume for effective ML application remains limited. In this study, we comprehensively analyzed water quality, operational, and biological data collected from a full-scale WWTP over 970 days. Our results demonstrate that ML models can predict not only operational and water quality parameters (concentrations of dissolved oxygen and effluent chemical oxygen demand) but also the abundances of functional bacteria. Notably, we discovered that increasing data volume does not always improve model performance, and that data collection intervals do not need to be excessively small, as moderate intervals can still yield reliable predictions. These findings suggest that excessively large datasets may not be necessary for effective ML predictions in WWTPs. Overall, this study underscores the importance of optimizing dataset size to balance computation efficiency and prediction accuracy, providing valuable insights into data management strategies that can enhance the operational efficiency and sustainability of WWTPs.
污水处理厂会产生大量的水质、运行和生物数据。人们越来越认识到这些大数据,特别是通过机器学习(ML)来改善污水处理厂管理的潜力。然而,随着数据集规模的增大,与数据收集和处理相关的成本可能会急剧上升,而关于确定有效应用机器学习的最佳数据量的研究仍然有限。在本研究中,我们全面分析了从一座全尺寸污水处理厂在970天内收集的水质、运行和生物数据。我们的结果表明,机器学习模型不仅可以预测运行和水质参数(溶解氧浓度和出水化学需氧量),还可以预测功能细菌的丰度。值得注意的是,我们发现增加数据量并不总是能提高模型性能,而且数据收集间隔不需要过小,因为适度的间隔仍然可以产生可靠的预测。这些发现表明,对于污水处理厂中有效的机器学习预测而言,可能并不需要过大的数据集。总体而言,本研究强调了优化数据集大小以平衡计算效率和预测准确性的重要性,为可提高污水处理厂运行效率和可持续性的数据管理策略提供了有价值的见解。