Nong Xizhi, Lai Cheng, Chen Lihua, Wei Jiahua
College of Civil Engineering and Architecture, Guangxi University, Nanning 530004, China; State Key Laboratory of Hydroscience and Engineering, Tsinghua University, Beijing 100084, China; Centre for Urban Sustainability and Resilience, Department of Civil, Environmental and Geomatic Engineering, University College London, London WC1E 6BT, UK; School of Computing and Engineering, University of West London, London W5 5RF, UK.
College of Civil Engineering and Architecture, Guangxi University, Nanning 530004, China.
Sci Total Environ. 2024 Nov 10;950:175281. doi: 10.1016/j.scitotenv.2024.175281. Epub 2024 Aug 6.
Machine learning models (MLMs) have been increasingly used to forecast water pollution. However, the "black box" characteristic for understanding mechanism processes still limits the applicability of MLMs for water quality management in hydro-projects under complex and frequently artificial regulation. This study proposes an interpretable machine learning framework for water quality prediction coupled with a hydrodynamic (flow discharge) scenario-based Random Forest (RF) model with multiple model-agnostic techniques and quantifies global, local, and joint interpretations (i.e., partial dependence, individual conditional expectation, and accumulated local effects) of environmental factor implications. The framework was applied and verified to predict the permanganate index (COD) under different flow discharge regulation scenarios in the Middle Route of the South-to-North Water Diversion Project of China (MRSNWDPC). A total of 4664 sampling cases data matrices, including water quality, meteorological, and hydrological indicators from eight national stations along the main canal of the MRSNWDPC, were collected from May 2019 to December 2020. The results showed that the RF models were effective in forecasting COD in all flow discharge scenarios, with a mean square error, coefficient of determination, and mean absolute error of 0.006-0.026, 0.481-0.792, and 0.069-0.104, respectively, in the testing dataset. A global interpretation indicated that dissolved oxygen, flow discharge, and surface pressure are the three most important variables of COD. Local and joint interpretations indicated that the RF-based prediction model provides a basic understanding of the physical mechanisms of environmental systems. The proposed framework can effectively learn the fundamental environmental implications of water quality variations and provide reliable prediction performance, highlighting the importance of model interpretability for trustworthy machine learning applications in water management projects. This study provides scientific references for applying advanced data-driven MLMs to water quality forecasting and a reliable methodological framework for water quality management and similar hydro-projects.
机器学习模型(MLMs)已越来越多地用于预测水污染。然而,其用于理解机制过程的“黑箱”特性仍然限制了MLMs在复杂且频繁人工调控的水利工程水质管理中的适用性。本研究提出了一种用于水质预测的可解释机器学习框架,该框架结合了基于水动力(流量)情景的随机森林(RF)模型以及多种与模型无关的技术,并对环境因素影响进行了全局、局部和联合解释(即偏依赖、个体条件期望和累积局部效应)。该框架在中国南水北调中线工程(MRSNWDPC)不同流量调控情景下对高锰酸盐指数(COD)进行了应用和验证。从2019年5月至2020年12月收集了MRSNWDPC总干渠沿线8个国家站点的共计4664个采样案例数据矩阵,包括水质、气象和水文指标。结果表明,RF模型在所有流量情景下对COD的预测均有效,测试数据集中的均方误差、决定系数和平均绝对误差分别为0.006 - 0.026、0.481 - 0.792和0.069 - 0.104。全局解释表明,溶解氧、流量和表面压力是COD的三个最重要变量。局部和联合解释表明,基于RF的预测模型提供了对环境系统物理机制的基本理解。所提出的框架能够有效地了解水质变化的基本环境影响,并提供可靠的预测性能,突出了模型可解释性在水管理项目中值得信赖的机器学习应用中的重要性。本研究为将先进的数据驱动MLMs应用于水质预测提供了科学参考,为水质管理及类似水利工程提供了可靠的方法框架。