School of Environmental Science and Engineering, Xiamen University of Technology, Xiamen 361024, China; Drinking Water Science and Technology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; Key Laboratory of Water Resources Utilization and Protection, Xiamen city, Xiamen 361005, China.
Department of Civil and Environmental Engineering, South Dakota School of Mines and Technology, Rapid City, SD 57701, USA.
Sci Total Environ. 2024 Nov 15;951:175573. doi: 10.1016/j.scitotenv.2024.175573. Epub 2024 Aug 15.
Determining the occurrence of disinfection byproducts (DBPs) in drinking water distribution system (DWDS) remains challenging. Predicting DBPs using readily available water quality parameters can help to understand DBPs associated risks and capture the complex interrelationships between water quality and DBP occurrence. In this study, we collected drinking water samples from a distribution network throughout a year and measured the related water quality parameters (WQPs) and haloacetic acids (HAAs). 12 machine learning (ML) algorithms were evaluated. Random Forest (RF) achieved the best performance (i.e., R of 0.78 and RMSE of 7.74) for predicting HAAs concentration. Instead of using cytotoxicity or genotoxicity separately as the surrogate for evaluating toxicity associated with HAAs, we created a health risk index (HRI) that was calculated as the sum of cytotoxicity and genotoxicity of HAAs following the widely used Tic-Tox approach. Similarly, ML models were developed to predict the HRI, and RF model was found to perform the best, obtaining R of 0.69 and RMSE of 0.38. To further explore advanced ML approaches, we developed 3 models using uncertainty-based active learning. Our findings revealed that Categorical Boosting Regression (CAT) model developed through active learning substantially outperformed other models, achieving R of 0.87 and 0.82 for predicting concentration and the HRI, respectively. Feature importance analysis with the CAT model revealed that temperature, ions (e.g., chloride and nitrate), and DOC concentration in the distribution network had a significant impact on the occurrence of HAAs. Meanwhile, chloride ion, pH, ORP, and free chlorine were found as the most important features for HRI prediction. This study demonstrates that ML has the potential in the prediction of HAA occurrence and toxicity. By identifying key WQPs impacting HAA occurrence and toxicity, this research offers valuable insights for targeted DBP mitigation strategies.
确定饮用水分配系统(DWDS)中消毒副产物(DBP)的发生仍然具有挑战性。使用现成的水质参数预测 DBP 有助于了解与 DBP 相关的风险,并捕捉水质与 DBP 发生之间的复杂相互关系。在这项研究中,我们全年从一个分配网络中采集饮用水样本,并测量相关的水质参数(WQPs)和卤乙酸(HAAs)。评估了 12 种机器学习(ML)算法。随机森林(RF)在预测 HAAs 浓度方面表现出最佳性能(即 R 为 0.78,RMSE 为 7.74)。我们没有分别使用细胞毒性或遗传毒性作为评估 HAAs 相关毒性的替代物,而是按照广泛使用的 Tic-Tox 方法,创建了一个健康风险指数(HRI),该指数是 HAAs 的细胞毒性和遗传毒性之和。同样,开发了 ML 模型来预测 HRI,发现 RF 模型表现最好,获得的 R 为 0.69,RMSE 为 0.38。为了进一步探索先进的 ML 方法,我们使用基于不确定性的主动学习开发了 3 种模型。我们的研究结果表明,通过主动学习开发的分类提升回归(CAT)模型表现明显优于其他模型,分别用于预测浓度和 HRI,其 R 分别为 0.87 和 0.82。使用 CAT 模型进行特征重要性分析表明,温度、离子(如氯和硝酸盐)以及管网中的 DOC 浓度对 HAAs 的发生有重大影响。同时,氯离子、pH 值、ORP 和游离氯被发现是预测 HRI 的最重要特征。本研究表明,ML 具有预测 HAAs 发生和毒性的潜力。通过确定影响 HAAs 发生和毒性的关键 WQPs,本研究为有针对性的 DBP 缓解策略提供了有价值的见解。