Gu Tengfei, Duan Ping, Wang Mingguo, Li Jia, Zhang Yanke
Faculty of Geography, Yunnan Normal University, Kunming, 650500, China.
Badong National Observation and Research Station of Geohazards, China University of Geosciences (Wuhan), Wuhan, 430074, China.
Sci Rep. 2024 Mar 26;14(1):7201. doi: 10.1038/s41598-024-57964-5.
This study aims to explore the effects of different non-landslide sampling strategies on machine learning models in landslide susceptibility mapping. Non-landslide samples are inherently uncertain, and the selection of non-landslide samples may suffer from issues such as noisy or insufficient regional representations, which can affect the accuracy of the results. In this study, a positive-unlabeled (PU) bagging semi-supervised learning method was introduced for non-landslide sample selection. In addition, buffer control sampling (BCS) and K-means (KM) clustering were applied for comparative analysis. Based on landslide data from Qiaojia County, Yunnan Province, China, collected in 2014, three machine learning models, namely, random forest, support vector machine, and CatBoost, were used for landslide susceptibility mapping. The results show that the quality of samples selected using different non-landslide sampling strategies varies significantly. Overall, the quality of non-landslide samples selected using the PU bagging method is superior, and this method performs best when combined with CatBoost for predicting (AUC = 0.897) landslides in very high and high susceptibility zones (82.14%). Additionally, the KM results indicated overfitting, displaying high accuracy for validation but poor statistical outcomes for zoning. The BCS results were the worst.
本研究旨在探讨不同的非滑坡采样策略对滑坡易发性制图中机器学习模型的影响。非滑坡样本本身具有不确定性,非滑坡样本的选择可能会遇到诸如噪声或区域代表性不足等问题,这可能会影响结果的准确性。在本研究中,引入了一种正无标签(PU)装袋半监督学习方法用于非滑坡样本选择。此外,还应用了缓冲区控制采样(BCS)和K均值(KM)聚类进行对比分析。基于2014年在中国云南省巧家县收集的滑坡数据,使用随机森林、支持向量机和CatBoost这三种机器学习模型进行滑坡易发性制图。结果表明,使用不同的非滑坡采样策略选择的样本质量差异显著。总体而言,使用PU装袋方法选择的非滑坡样本质量更优,并且该方法与CatBoost结合用于预测极高和高易发性区域(82.14%)的滑坡时表现最佳(AUC = 0.897)。此外,KM结果显示存在过拟合,验证时准确率高,但分区的统计结果较差。BCS结果最差。