AT&T Labs-Research, 180 Park Avenue, Florham Park, New Jersey 07932, USA.
Ecology. 2013 Jun;94(6):1409-19. doi: 10.1890/12-1520.1.
A fundamental ecological modeling task is to estimate the probability that a species is present in (or uses) a site, conditional on environmental variables. For many species, available data consist of "presence" data (locations where the species [or evidence of it] has been observed), together with "background" data, a random sample of available environmental conditions. Recently published papers disagree on whether probability of presence is identifiable from such presence-background data alone. This paper aims to resolve the disagreement, demonstrating that additional information is required. We defined seven simulated species representing various simple shapes of response to environmental variables (constant, linear, convex, unimodal, S-shaped) and ran five logistic model-fitting methods using 1000 presence samples and 10 000 background samples; the simulations were repeated 100 times. The experiment revealed a stark contrast between two groups of methods: those based on a strong assumption that species' true probability of presence exactly matches a given parametric form had highly variable predictions and much larger RMS error than methods that take population prevalence (the fraction of sites in which the species is present) as an additional parameter. For six species, the former group grossly under- or overestimated probability of presence. The cause was not model structure or choice of link function, because all methods were logistic with linear and, where necessary, quadratic terms. Rather, the experiment demonstrates that an estimate of prevalence is not just helpful, but is necessary (except in special cases) for identifying probability of presence. We therefore advise against use of methods that rely on the strong assumption, due to Lele and Keim (recently advocated by Royle et al.) and Lancaster and Imbens. The methods are fragile, and their strong assumption is unlikely to be true in practice. We emphasize, however, that we are not arguing against standard statistical methods such as logistic regression, generalized linear models, and so forth, none of which requires the strong assumption. If probability of presence is required for a given application, there is no panacea for lack of data. Presence-background data must be augmented with an additional datum, e.g., species' prevalence, to reliably estimate absolute (rather than relative) probability of presence.
一个基本的生态建模任务是估计一个物种在(或使用)一个地点的存在概率,条件是环境变量。对于许多物种,可用的数据包括“存在”数据(观察到该物种[或其证据]的地点),以及“背景”数据,这是可用环境条件的随机样本。最近发表的论文对仅从这种存在-背景数据是否可以识别存在概率存在分歧。本文旨在解决这一分歧,证明需要额外的信息。我们定义了七个模拟物种,代表了对环境变量的各种简单形状的响应(常数、线性、凸、单峰、S 形),并使用 1000 个存在样本和 10000 个背景样本运行了 5 种逻辑模型拟合方法;模拟重复了 100 次。实验揭示了两种方法之间的鲜明对比:一种方法基于一个强烈的假设,即物种的真实存在概率完全符合给定的参数形式,其预测变化很大,均方根误差(RMS)比采用种群流行率(物种存在的地点比例)作为附加参数的方法大得多。对于六种物种,前者严重低估或高估了存在概率。原因不是模型结构或链接函数的选择,因为所有方法都是逻辑的,具有线性和(必要时)二次项。相反,实验表明,流行率的估计不仅是有帮助的,而且对于识别存在概率是必要的(除非在特殊情况下)。因此,我们建议不要使用依赖于强假设的方法,这是由于 Lele 和 Keim(最近由 Royle 等人提倡)以及 Lancaster 和 Imbens 提出的。这些方法很脆弱,而且它们的强假设在实践中不太可能成立。然而,我们强调,我们并不是在反对诸如逻辑回归、广义线性模型等标准统计方法,这些方法都不需要强假设。如果给定应用程序需要存在概率,则不存在缺乏数据的万能解决方案。存在-背景数据必须与附加数据(例如,物种的流行率)一起进行扩充,以便可靠地估计绝对(而不是相对)存在概率。