Abrego Nerea, Ovaskainen Otso
Department of Biological and Environmental Science University of Jyväskylä Jyväskylä Finland.
Department of Agricultural Sciences University of Helsinki Helsinki Finland.
Ecol Evol. 2023 Dec 18;13(12):e10784. doi: 10.1002/ece3.10784. eCollection 2023 Dec.
When comparing multiple models of species distribution, models yielding higher predictive performance are clearly to be favored. A more difficult question is how to decide whether even the best model is "good enough". Here, we clarify key choices and metrics related to evaluating the predictive performance of presence-absence models. We use a hierarchical case study to evaluate how four metrics of predictive performance (AUC, Tjur's , max-Kappa, and max-TSS) relate to each other, the random and fixed effects parts of the model, the spatial scale at which predictive performance is measured, and the cross-validation strategy chosen. We demonstrate that the very same metric can achieve different values for the very same model, even when similar cross-validation strategies are followed, depending on the spatial scale at which predictive performance is measured. Among metrics, Tjur's and max-Kappa generally increase with species' prevalence, whereas AUC and max-TSS are largely independent of prevalence. Thus, Tjur's and max-Kappa often reach lower values when measured at the smallest scales considered in the study, while AUC and max-TSS reaching similar values across the different spatial levels included in the study. However, they provide complementary insights on predictive performance. The very same model may appear excellent or poor not only due to the applied metric, but also how predictive performance is exactly calculated, calling for great caution on the interpretation of predictive performance. The most comprehensive evaluation of predictive performance can be obtained by evaluating predictive performance through the combination of measures providing complementary insights. Instead of following simple rules of thumb or focusing on absolute values, we recommend comparing the achieved predictive performance to the researcher's own a priori expectations on how easy it is to make predictions related to the same question that the model is used for.
在比较多种物种分布模型时,显然应青睐具有更高预测性能的模型。一个更棘手的问题是如何确定即使是最好的模型是否“足够好”。在此,我们阐明了与评估有无模型预测性能相关的关键选择和指标。我们使用分层案例研究来评估预测性能的四个指标(AUC、Tjur's 、最大Kappa值和最大TSS)如何相互关联,模型的随机效应和固定效应部分,测量预测性能的空间尺度,以及所选择的交叉验证策略。我们证明,即使遵循相似的交叉验证策略,对于同一个模型,根据测量预测性能的空间尺度不同,同一个指标也可能得到不同的值。在这些指标中,Tjur's 和最大Kappa值通常会随着物种的发生率而增加,而AUC和最大TSS在很大程度上与发生率无关。因此,在研究中考虑的最小尺度上进行测量时,Tjur's 和最大Kappa值通常会达到较低的值,而AUC和最大TSS在研究中包含的不同空间水平上达到相似的值。然而,它们提供了关于预测性能的互补见解。同一个模型可能看起来很棒或很差,不仅是由于所应用的指标,还取决于预测性能的确切计算方式,这就要求在解释预测性能时要格外谨慎。通过结合提供互补见解的措施来评估预测性能,可以获得对预测性能最全面的评估。我们建议不要遵循简单的经验法则或关注绝对值,而是将所实现的预测性能与研究人员自己对就模型所用于的相同问题进行预测的难易程度的先验期望进行比较。