Geography and Environment, School of Geography, University of Southampton, Southampton, SO17 1BJ, United Kingdom.
CVRM-, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001 Lisboa, Portugal.
Sci Total Environ. 2014 Apr 1;476-477:189-206. doi: 10.1016/j.scitotenv.2014.01.001. Epub 2014 Jan 24.
Watershed management decisions need robust methods, which allow an accurate predictive modeling of pollutant occurrences. Random Forest (RF) is a powerful machine learning data driven method that is rarely used in water resources studies, and thus has not been evaluated thoroughly in this field, when compared to more conventional pattern recognition techniques key advantages of RF include: its non-parametric nature; high predictive accuracy; and capability to determine variable importance. This last characteristic can be used to better understand the individual role and the combined effect of explanatory variables in both protecting and exposing groundwater from and to a pollutant. In this paper, the performance of the RF regression for predictive modeling of nitrate pollution is explored, based on intrinsic and specific vulnerability assessment of the Vega de Granada aquifer. The applicability of this new machine learning technique is demonstrated in an agriculture-dominated area where nitrate concentrations in groundwater can exceed the trigger value of 50 mg/L, at many locations. A comprehensive GIS database of twenty-four parameters related to intrinsic hydrogeologic proprieties, driving forces, remotely sensed variables and physical-chemical variables measured in "situ", were used as inputs to build different predictive models of nitrate pollution. RF measures of importance were also used to define the most significant predictors of nitrate pollution in groundwater, allowing the establishment of the pollution sources (pressures). The potential of RF for generating a vulnerability map to nitrate pollution is assessed considering multiple criteria related to variations in the algorithm parameters and the accuracy of the maps. The performance of the RF is also evaluated in comparison to the logistic regression (LR) method using different efficiency measures to ensure their generalization ability. Prediction results show the ability of RF to build accurate models with strong predictive capabilities.
流域管理决策需要强大的方法,以便能够准确地预测污染物的发生情况。随机森林(RF)是一种强大的机器学习数据驱动方法,在水资源研究中很少使用,因此与更传统的模式识别技术相比,在该领域尚未得到充分评估。RF 的主要优点包括:它的非参数性质;高预测准确性;以及确定变量重要性的能力。最后一个特征可用于更好地了解解释变量在保护和暴露地下水免受和免受污染物影响方面的单独作用和综合影响。本文基于 Vega de Granada 含水层的固有和特定脆弱性评估,探讨了 RF 回归在预测硝酸盐污染建模中的性能。在许多地点,地下水硝酸盐浓度可能超过 50mg/L 的触发值的农业为主的地区,演示了这种新机器学习技术的适用性。使用与内在水文地质特性、驱动力、遥感变量和“原位”测量的物理化学变量相关的二十四参数的综合 GIS 数据库作为输入,构建了不同的硝酸盐污染预测模型。RF 重要性度量还用于定义地下水硝酸盐污染的最重要预测因子,从而确定污染源(压力)。考虑到与算法参数变化和地图准确性相关的多个标准,评估了 RF 生成硝酸盐污染脆弱性图的潜力。还使用不同的效率衡量标准来评估 RF 与逻辑回归(LR)方法的性能,以确保其泛化能力。预测结果表明 RF 具有构建具有强大预测能力的准确模型的能力。