Suppr超能文献

样本比例和大小对随机森林模型预测蜗牛栖息地潜在分布性能的影响。

Impacts of sample ratio and size on the performance of random forest model to predict the potential distribution of snail habitats.

机构信息

Key Laboratory of Public Health Safety of Ministry of Education, Department of Epidemiology and Health statistics, School of Public Health, Fudan University, Shanghai.

Sydney School of Veterinary Science, The University of Sydney, Sydney.

出版信息

Geospat Health. 2023 Jul 3;18(2). doi: 10.4081/gh.2023.1151.

Abstract

Few studies have considered the impacts of sample size and sample ratio of presence and absence points on the results of random forest (RF) testing. We applied this technique for the prediction of the spatial distribution of snail habitats based on a total of 15,000 sample points (5,000 presence samples and 10,000 control points). RF models were built using seven different sample ratios (1:1, 1:2, 1:3, 1:4, 2:1, 3:1, and 4:1) and the optimal ratio was identified via the Area Under the Curve (AUC) statistic. The impact of sample size was compared by RF models under the optimal ratio and the optimal sample size. When the sample size was small, the sampling ratios of 1:1, 1:2 and 1:3 were significantly better than the sample ratios of 4:1 and 3:1 at all four levels of sample sizes (p<0.01) and there was no significant difference among the ratios of 1:1, 1:2 and 1:3 (p>0.05). The sample ratio of 1:2 appeared to be optimal for a relatively large sample size with the lowest quartile deviation. In addition, increasing the sample size produced a higher AUC and a smaller slope and the most suitable sample size found in this study was 2400 (AUC=0.96). This study provides a feasible idea to select an appropriate sample size and sample ratio for ecological niche modelling (ENM) and also provides a scientific basis for the selection of samples to accurately identify and predict snail habitat distributions.

摘要

很少有研究考虑样本量和存在点与缺失点的样本比例对随机森林(RF)测试结果的影响。我们应用这项技术来预测蜗牛栖息地的空间分布,共使用了 15000 个样本点(5000 个存在样本和 10000 个对照点)。使用七种不同的样本比例(1:1、1:2、1:3、1:4、2:1、3:1 和 4:1)构建 RF 模型,并通过曲线下面积(AUC)统计量确定最佳比例。通过 RF 模型比较了最佳比例和最佳样本量下的样本量的影响。当样本量较小时,在所有四个样本量水平上,1:1、1:2 和 1:3 的采样比例明显优于 4:1 和 3:1 的采样比例(p<0.01),并且 1:1、1:2 和 1:3 的比例之间没有显著差异(p>0.05)。在较小的样本量下,1:2 的比例似乎是最优的,其四分位偏差最低。此外,增加样本量会产生更高的 AUC 和更小的斜率,并且本研究中发现的最合适的样本量为 2400(AUC=0.96)。本研究为生态位模型(ENM)选择合适的样本量和样本比例提供了一种可行的思路,也为准确识别和预测蜗牛栖息地分布提供了样本选择的科学依据。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验