School of Environment, Faculty of Science, University of Auckland, New Zealand.
School of Environment, Faculty of Science, University of Auckland, New Zealand.
Water Res. 2020 Jun 15;177:115788. doi: 10.1016/j.watres.2020.115788. Epub 2020 Apr 13.
Predicting recreational water quality is one of the most difficult tasks in water management with major implications for humans and society. Many data-driven models have been used to predict water quality indicators to allow a real time assessment of public health risk. This assessment is most commonly based on Faecal Indicator Bacteria (FIB), with the value of FIB compared with thresholds published in guidelines. However, FIB values usually tend to be unbalanced within water quality datasets, with small proportions of data exceeding guideline thresholds and far larger numbers that do not. This can be a limiting factor in the uptake of model predictions since, even if the overall accuracy is high, the sensitivity of the predictions can be low. To address this issue, this paper proposes an adaptive synthetic sampling algorithm (ADASYN) to generate synthetic above-threshold FIB instances and test the validity of the approach for the prediction of recreational water quality. The models in this paper are based on four machine learning techniques: k-mean nearest neighbour, boosting decision tree, support vector machine, and multi-layer perceptron artificial neural network and are applied to five different locations in Auckland, New Zealand. Aside from support vector machine, all models provide favourable predictions with relatively high sensitivity (around 75%) and overall accuracy (over 90%), indicating that both the compliant and exceedance conditions can be effectively predicted through the use of more sophisticated model training which involves artificial data. Considering the model accuracy and stability, boosting decision trees (BDT) and multi-layer perceptron artificial neural (MLP-ANN) network are the best two models and the multi-layer perceptron is the most efficient with the shortest computation time.
预测娱乐用水水质是水管理中最具挑战性的任务之一,对人类和社会都有重大影响。许多数据驱动的模型已被用于预测水质指标,以实时评估公共健康风险。这种评估最常基于粪便指示菌(FIB),将 FIB 值与指南中公布的阈值进行比较。然而,FIB 值在水质数据集中通常倾向于不平衡,只有一小部分数据超过了指南的阈值,而远远超过了更大数量的没有超过阈值的数据。这可能是模型预测采用的一个限制因素,因为即使整体准确性很高,预测的敏感性也可能很低。为了解决这个问题,本文提出了一种自适应合成采样算法(ADASYN)来生成合成的超过阈值的 FIB 实例,并测试该方法在预测娱乐用水水质方面的有效性。本文中的模型基于四种机器学习技术:k-最近邻均值、提升决策树、支持向量机和多层感知机人工神经网络,并应用于新西兰奥克兰的五个不同地点。除了支持向量机,所有模型都提供了有利的预测,具有相对较高的敏感性(约 75%)和整体准确性(超过 90%),这表明通过使用更复杂的模型训练,包括人工数据,既可以有效地预测符合规定的条件,也可以预测超标条件。考虑到模型的准确性和稳定性,提升决策树(BDT)和多层感知机人工神经网络(MLP-ANN)网络是两个最佳模型,而多层感知机的效率最高,计算时间最短。