Uddin Md Galal, Rahman Azizur, Rosa Taghikhah Firouzeh, Olbert Agnieszka I
School of Engineering, University of Galway, Ireland; Ryan Institute, University of Galway, Ireland; MaREI Research Centre, University of Galway, Ireland; Eco-HydroInformatics Research Group (EHIRG), Civil Engineering, National University of Ireland Galway, Ireland.
School of Computing, Mathematics and Engineering, Charles Sturt University, Wagga, Australia; The Gulbali Institute of Agriculture, Water and Environment, Charles Sturt University, Wagga, Australia.
Water Res. 2024 May 15;255:121499. doi: 10.1016/j.watres.2024.121499. Epub 2024 Mar 20.
Recently, there has been a significant advancement in the water quality index (WQI) models utilizing data-driven approaches, especially those integrating machine learning and artificial intelligence (ML/AI) technology. Although, several recent studies have revealed that the data-driven model has produced inconsistent results due to the data outliers, which significantly impact model reliability and accuracy. The present study was carried out to assess the impact of data outliers on a recently developed Irish Water Quality Index (IEWQI) model, which relies on data-driven techniques. To the author's best knowledge, there has been no systematic framework for evaluating the influence of data outliers on such models. For the purposes of assessing the outlier impact of the data outliers on the water quality (WQ) model, this was the first initiative in research to introduce a comprehensive approach that combines machine learning with advanced statistical techniques. The proposed framework was implemented in Cork Harbour, Ireland, to evaluate the IEWQI model's sensitivity to outliers in input indicators to assess the water quality. In order to detect the data outlier, the study utilized two widely used ML techniques, including Isolation Forest (IF) and Kernel Density Estimation (KDE) within the dataset, for predicting WQ with and without these outliers. For validating the ML results, the study used five commonly used statistical measures. The performance metric (R) indicates that the model performance improved slightly (R increased from 0.92 to 0.95) in predicting WQ after removing the data outlier from the input. But the IEWQI scores revealed that there were no statistically significant differences among the actual values, predictions with outliers, and predictions without outliers, with a 95 % confidence interval at p < 0.05. The results of model uncertainty also revealed that the model contributed <1 % uncertainty to the final assessment results for using both datasets (with and without outliers). In addition, all statistical measures indicated that the ML techniques provided reliable results that can be utilized for detecting outliers and their impacts on the IEWQI model. The findings of the research reveal that although the data outliers had no significant impact on the IEWQI model architecture, they had moderate impacts on the rating schemes' of the model. This finding indicated that detecting the data outliers could improve the accuracy of the IEWQI model in rating WQ as well as be helpful in mitigating the model eclipsing problem. In addition, the results of the research provide evidence of how the data outliers influenced the data-driven model in predicting WQ and reliability, particularly since the study confirmed that the IEWQI model's could be effective for accurately rating WQ despite the presence of the data outliers in the input. It could occur due to the spatio-temporal variability inherent in WQ indicators. However, the research assesses the influence of data input outliers on the IEWQI model and underscores important areas for future investigation. These areas include expanding temporal analysis using multi-year data, examining spatial outlier patterns, and evaluating detection methods. Moreover, it is essential to explore the real-world impacts of revised rating categories, involve stakeholders in outlier management, and fine-tune model parameters. Analysing model performance across varying temporal and spatial resolutions and incorporating additional environmental data can significantly enhance the accuracy of WQ assessment. Consequently, this study offers valuable insights to strengthen the IEWQI model's robustness and provides avenues for enhancing its utility in broader WQ assessment applications. Moreover, the study successfully adopted the framework for evaluating how data input outliers affect the data-driven model, such as the IEWQI model. The current study has been carried out in Cork Harbour for only a single year of WQ data. The framework should be tested across various domains for evaluating the response of the IEWQI model's in terms of the spatio-temporal resolution of the domain. Nevertheless, the study recommended that future research should be conducted to adjust or revise the IEWQI model's rating schemes and investigate the practical effects of data outliers on updated rating categories. However, the study provides potential recommendations for enhancing the IEWQI model's adaptability and reveals its effectiveness in expanding its applicability in more general WQ assessment scenarios.
最近,利用数据驱动方法的水质指数(WQI)模型取得了重大进展,尤其是那些整合了机器学习和人工智能(ML/AI)技术的模型。尽管最近的几项研究表明,由于数据异常值,数据驱动模型产生了不一致的结果,这显著影响了模型的可靠性和准确性。本研究旨在评估数据异常值对最近开发的依赖数据驱动技术的爱尔兰水质指数(IEWQI)模型的影响。据作者所知,尚无用于评估数据异常值对此类模型影响的系统框架。为了评估数据异常值对水质(WQ)模型的异常值影响,这是研究中首次引入一种将机器学习与先进统计技术相结合的综合方法。所提出的框架在爱尔兰的科克港实施,以评估IEWQI模型对输入指标中异常值的敏感性,从而评估水质。为了检测数据异常值,该研究在数据集中使用了两种广泛使用的ML技术,包括孤立森林(IF)和核密度估计(KDE),用于预测有无这些异常值时的水质。为了验证ML结果,该研究使用了五种常用的统计量度。性能指标(R)表明,从输入中去除数据异常值后,模型在预测水质方面的性能略有提高(R从0.92提高到0.95)。但IEWQI分数显示,实际值、有异常值的预测和无异常值的预测之间在统计学上没有显著差异,在p < 0.05时置信区间为95%。模型不确定性的结果还表明,对于使用两个数据集(有和无异常值),模型对最终评估结果的贡献<1%。此外,所有统计量度均表明,ML技术提供了可靠的结果,可用于检测异常值及其对IEWQI模型的影响。研究结果表明,尽管数据异常值对IEWQI模型架构没有显著影响,但它们对模型的评级方案有中等影响。这一发现表明,检测数据异常值可以提高IEWQI模型在水质评级方面的准确性,并有助于缓解模型掩盖问题。此外,研究结果提供了证据,证明数据异常值在预测水质和可靠性方面如何影响数据驱动模型,特别是因为该研究证实,尽管输入中存在数据异常值,IEWQI模型仍可有效地准确评级水质。这可能是由于水质指标固有的时空变异性。然而,该研究评估了数据输入异常值对IEWQI模型的影响,并强调了未来调查的重要领域。这些领域包括使用多年数据扩展时间分析、检查空间异常值模式以及评估检测方法。此外,探索修订评级类别的实际影响、让利益相关者参与异常值管理以及微调模型参数至关重要。分析不同时间和空间分辨率下的模型性能并纳入额外的环境数据可以显著提高水质评估的准确性。因此,本研究提供了宝贵的见解,以加强IEWQI模型的稳健性,并为在更广泛的水质评估应用中提高其效用提供了途径。此外,该研究成功采用了评估数据输入异常值如何影响数据驱动模型(如IEWQI模型)的框架。目前的研究仅在科克港针对一年的水质数据进行。该框架应在各个领域进行测试,以评估IEWQI模型在不同领域时空分辨率方面的响应。尽管如此,该研究建议未来应进行研究,以调整或修订IEWQI模型的评级方案,并调查数据异常值对更新评级类别的实际影响。然而,该研究为提高IEWQI模型的适应性提供了潜在建议,并揭示了其在扩大其在更一般水质评估场景中的适用性方面的有效性。