Ter Braak Cajo J F, Peres-Neto Pedro, Dray Stéphane
Biometris, Wageningen University & Research , Wageningen , The Netherlands.
Department of Biology, Concordia University , Montreal , Canada.
PeerJ. 2017 Jan 12;5:e2885. doi: 10.7717/peerj.2885. eCollection 2017.
Statistical testing of trait-environment association from data is a challenge as there is no common unit of observation: the trait is observed on species, the environment on sites and the mediating abundance on species-site combinations. A number of correlation-based methods, such as the community weighted trait means method (CWM), the fourth-corner correlation method and the multivariate method RLQ, have been proposed to estimate such trait-environment associations. In these methods, valid statistical testing proceeds by performing two separate resampling tests, one site-based and the other species-based and by assessing significance by the largest of the two -values (the test). Recently, regression-based methods using generalized linear models (GLM) have been proposed as a promising alternative with statistical inference via site-based resampling. We investigated the performance of this new approach along with approaches that mimicked the test using GLM instead of fourth-corner. By simulation using models with additional random variation in the species response to the environment, the site-based resampling tests using GLM are shown to have severely inflated type I error, of up to 90%, when the nominal level is set as 5%. In addition, predictive modelling of such data using site-based cross-validation very often identified trait-environment interactions that had no predictive value. The problem that we identify is not an "omitted variable bias" problem as it occurs even when the additional random variation is independent of the observed trait and environment data. Instead, it is a problem of ignoring a random effect. In the same simulations, the GLM-based test controlled the type I error in all models proposed so far in this context, but still gave slightly inflated error in more complex models that included both missing (but important) traits and missing (but important) environmental variables. For screening the importance of single trait-environment combinations, the fourth-corner test is shown to give almost the same results as the GLM-based tests in far less computing time.
从数据中对性状与环境关联进行统计检验是一项挑战,因为不存在共同的观测单位:性状是在物种上观测的,环境是在地点上观测的,而介导的丰度是在物种 - 地点组合上观测的。已经提出了许多基于相关性的方法,例如群落加权性状均值法(CWM)、第四角相关性方法和多元方法RLQ,来估计这种性状与环境的关联。在这些方法中,有效的统计检验是通过执行两个单独的重采样检验来进行的,一个基于地点,另一个基于物种,并通过两者中较大的 - 值(检验)来评估显著性。最近,已经提出了使用广义线性模型(GLM)的基于回归的方法,作为通过基于地点的重采样进行统计推断的一种有前途的替代方法。我们研究了这种新方法以及使用GLM而不是第四角来模拟检验的方法的性能。通过使用在物种对环境的响应中具有额外随机变化的模型进行模拟,当名义水平设定为5%时,使用GLM的基于地点的重采样检验显示出高达90%的严重膨胀的I型错误。此外,使用基于地点的交叉验证对此类数据进行预测建模时,经常会识别出没有预测价值的性状 - 环境相互作用。我们所识别的问题不是“遗漏变量偏差”问题,因为即使额外的随机变化与观测到的性状和环境数据无关,该问题仍然会出现。相反,这是一个忽略随机效应的问题。在相同的模拟中,基于GLM的检验在本文迄今提出的所有模型中都控制了I型错误,但在包含缺失(但重要)性状和缺失(但重要)环境变量的更复杂模型中,仍然给出了略有膨胀的错误。对于筛选单个性状 - 环境组合的重要性,第四角检验显示在远少于基于GLM的检验的计算时间内给出几乎相同的结果。