预测欧洲病媒分布：物种分布模型在多大样本量时可靠？

Predicting vector distribution in Europe: at what sample size are species distribution models reliable?

作者信息

Mitchel Lianne, Hendrickx Guy, MacLeod Ewan T, Marsboom Cedric

机构信息

Deanery of Biomedical Sciences, College of Medicine and Veterinary Medicine, University of Edinburgh, Edinburgh, United Kingdom.

UK Health Security Agency (UKHSA), Bristol, United Kingdom.

出版信息

Front Vet Sci. 2025 May 29;12:1584864. doi: 10.3389/fvets.2025.1584864. eCollection 2025.

DOI:10.3389/fvets.2025.1584864

PMID:40510377

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12159067/

Abstract

INTRODUCTION

Species distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.

OBJECTIVE

Determine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.

MATERIALS AND METHODS

To overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector's relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10-5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen's Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605-0.804 for Cohen's Kappa and 0.795-0.894 for the remaining metrics (to three decimal places).

RESULTS

For balanced sample ratios, the optimum sample size for reliable models fell within the range of 750-1,000. Estimates increased to 1,100-1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.

CONCLUSION

To our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km) and extent (≥10,000 km). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.

摘要

引言

物种分布模型可以通过建立已知病媒分布与环境变量之间的关联，来预测媒介传播疾病的空间分布。鉴于欧洲气候不断变化以及媒介传播疾病发病率不断上升，病媒分布的模型预测可用于加强监测。然而，该领域缺乏标准化，对于何种样本量能产生可靠模型几乎没有共识。

目的

确定使用机器学习算法随机森林以及不同样本比例开发模型时的最佳样本量。

材料与方法

为克服真实病媒数据的局限性，在欧洲10个测试地点使用具有完全已知分布的模拟病媒随机生成不同样本量。测试地点涵盖了不同的栖息地适宜性和病媒的相对出现区域。使用24种不同样本量（10 - 5000之间）以及三种具有不同存在和缺失数据比例的样本比例（分别为50:50、20:80和40:60）开发了9000个随机森林模型。使用五个指标评估模型性能：正确分类百分比、敏感性、特异性、科恩卡帕系数和曲线下面积。这些指标按样本量和比例分组。当第25百分位数达到优秀性能阈值时确定最佳样本量，优秀性能定义为：科恩卡帕系数为0.605 - 0.804，其余指标为0.795 - 0.894（保留三位小数）。