State Key Laboratory of Oil and Gas Reservoir Geology and Exploitation, Southwest Petroleum University, Chengdu, 610500, China; Trenchless Technology Center, Louisiana Tech University, Ruston, LA, 71270, United States.
School of Science, Southwest University of Science and Technology, Mianyang, 621010, China.
Chemosphere. 2020 Jun;249:126169. doi: 10.1016/j.chemosphere.2020.126169. Epub 2020 Feb 11.
Water resources are the foundation of people's life and economic development, and are closely related to health and the environment. Accurate prediction of water quality is the key to improving water management and pollution control. In this paper, two novel hybrid decision tree-based machine learning models are proposed to obtain more accurate short-term water quality prediction results. The basic models of the two hybrid models are extreme gradient boosting (XGBoost) and random forest (RF), which respectively introduce an advanced data denoising technique - complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN). Taking the water resources of Gales Creek site in Tualatin River (one of the most polluted rivers in the world) Basin as an example, a total of 1875 data (hourly data) from May 1, 2019 to July 20, 2019 are collected. Two hybrid models are used to predict six water quality indicators, including water temperature, dissolved oxygen, pH value, specific conductance, turbidity, and fluorescent dissolved organic matter. Six error metrics are introduced as the basis of performance evaluation, and the results of the two models are compared with the other four conventional models. The results reveal that: (1) CEEMDAN-RF performs best in the prediction of temperature, dissolved oxygen and specific conductance, the mean absolute percentage errors (MAPEs) are 0.69%, 1.05%, and 0.90%, respectively. CEEMDAN-XGBoost performs best in the prediction of pH value, turbidity, and fluorescent dissolved organic matter, the MAPEs are 0.27%, 14.94%, and 1.59%, respectively. (2) The average MAPEs of CEEMDAN-RF and CEEMMDAN-XGBoost models are the smallest, which are 3.90% and 3.71% respectively, indicating that their overall prediction performance is the best. In addition, the stability of the prediction model is also discussed in this paper. The analysis shows that the prediction stability of CEEMDAN-RF and CEEMDAN-XGBoost is higher than other benchmark models.
水资源是人类生活和经济发展的基础,与健康和环境密切相关。准确预测水质是改善水资源管理和污染控制的关键。本文提出了两种基于决策树的新型混合机器学习模型,以获得更准确的短期水质预测结果。这两种混合模型的基本模型分别是极端梯度提升 (XGBoost) 和随机森林 (RF),它们分别引入了一种先进的数据去噪技术——完全集成经验模态分解自适应噪声 (CEEMDAN)。以图拉丁河流域盖尔斯溪站点的水资源为例,共采集了 2019 年 5 月 1 日至 7 月 20 日期间的 1875 个数据(每小时数据)。使用两种混合模型预测了包括水温、溶解氧、pH 值、电导率、浊度和荧光溶解有机物在内的六个水质指标。引入了六个误差指标作为性能评估的基础,并将两种模型的结果与其他四个传统模型进行了比较。结果表明:(1)CEEMDAN-RF 在水温、溶解氧和电导率的预测中表现最好,平均绝对百分比误差 (MAPE) 分别为 0.69%、1.05%和 0.90%。CEEMDAN-XGBoost 在 pH 值、浊度和荧光溶解有机物的预测中表现最好,MAPE 分别为 0.27%、14.94%和 1.59%。(2)CEEMDAN-RF 和 CEEMDAN-XGBoost 模型的平均 MAPE 最小,分别为 3.90%和 3.71%,表明它们的整体预测性能最好。此外,本文还讨论了预测模型的稳定性。分析表明,CEEMDAN-RF 和 CEEMDAN-XGBoost 的预测稳定性高于其他基准模型。