Guangdong Province Key Laboratory for Land Use and Consolidation, South China Agricultural University, Guangzhou 510642, China; College of Natural Resources and Environment, Joint Institute for Environment & Education, South China Agricultural University, Guangzhou 510642, China.
College of Natural Resources and Environment, Joint Institute for Environment & Education, South China Agricultural University, Guangzhou 510642, China.
J Environ Manage. 2023 Oct 15;344:118682. doi: 10.1016/j.jenvman.2023.118682. Epub 2023 Aug 9.
Machine learning (ML)-based urban waterlogging susceptibility studies suffer from class imbalance, as fewer positive samples are generally available than potential negative samples. Few studies have considered optimizing the results by improving the quality of training samples. To address this issue, we explored effective approaches to reliably increase the numbers of positive samples for such studies. The Synthetic Minority Over-Sampling Technique (SMOTE) and Optimized Seed Spread Algorithm (OSSA), representative of oversampling (synthesizing new samples based on the feature space) and physical (simulating potential inundated area based on the mechanisms of water flow) approaches, respectively, were employed to increase the number of positive samples. Waterlogging in Shenzhen was selected as a case study using eight selected spatial variables. An elaborate experiment was conducted to compare the quality of added samples based on the classifiers' performance and accuracy of waterlogging susceptibility maps (WSMs). The results indicated that (1) the performance of classifiers generated with SMOTE was worse than the original samples, while the use of OSSA improved the trained classifiers, and (2) the accuracy of WSMs was not improved with SMOTE but increased markedly with OSSA. These results may be driven by the diversity of information and features of the added samples. This study indicates the use of SMOTE fails to synthesize reliable samples when applied to waterlogging analysis in Shenzhen, whereas an effective solution for generating reliable positive samples is to use OSSA that simulates the potential submerged regions based on the mechanisms of disaster occurrence and spread.
基于机器学习 (ML) 的城市内涝易发性研究存在类别不平衡问题,因为正样本通常比潜在的负样本少。很少有研究考虑通过改进训练样本的质量来优化结果。为了解决这个问题,我们探讨了有效方法,以可靠地增加此类研究的正样本数量。过采样(基于特征空间合成新样本)和物理方法(基于水流机制模拟潜在淹没区)的代表性方法——合成少数过采样技术 (SMOTE) 和优化种子传播算法 (OSSA) 分别被用于增加正样本数量。选取深圳市内涝作为案例研究,选用了八个选定的空间变量。进行了精心的实验,比较了基于分类器性能和内涝易发性图 (WSM) 准确性的添加样本的质量。结果表明:(1) 使用 SMOTE 生成的分类器的性能不如原始样本,而使用 OSSA 则改进了训练分类器;(2) SMOTE 对内涝易发性图的准确性没有提高,而 OSSA 则显著提高。这些结果可能是由添加样本的信息和特征多样性驱动的。本研究表明,在应用于深圳内涝分析时,SMOTE 无法合成可靠的样本,而使用 OSSA 根据灾害发生和传播机制模拟潜在淹没区是生成可靠正样本的有效方法。