Forstmeier Wolfgang, Schielzeth Holger
Behav Ecol Sociobiol. 2011 Jan;65(1):47-55. doi: 10.1007/s00265-010-1038-5. Epub 2010 Aug 19.
Fitting generalised linear models (GLMs) with more than one predictor has become the standard method of analysis in evolutionary and behavioural research. Often, GLMs are used for exploratory data analysis, where one starts with a complex full model including interaction terms and then simplifies by removing non-significant terms. While this approach can be useful, it is problematic if significant effects are interpreted as if they arose from a single a priori hypothesis test. This is because model selection involves cryptic multiple hypothesis testing, a fact that has only rarely been acknowledged or quantified. We show that the probability of finding at least one 'significant' effect is high, even if all null hypotheses are true (e.g. 40% when starting with four predictors and their two-way interactions). This probability is close to theoretical expectations when the sample size (N) is large relative to the number of predictors including interactions (k). In contrast, type I error rates strongly exceed even those expectations when model simplification is applied to models that are over-fitted before simplification (low N/k ratio). The increase in false-positive results arises primarily from an overestimation of effect sizes among significant predictors, leading to upward-biased effect sizes that often cannot be reproduced in follow-up studies ('the winner's curse'). Despite having their own problems, full model tests and P value adjustments can be used as a guide to how frequently type I errors arise by sampling variation alone. We favour the presentation of full models, since they best reflect the range of predictors investigated and ensure a balanced representation also of non-significant results.
拟合具有多个预测变量的广义线性模型(GLMs)已成为进化和行为研究中的标准分析方法。通常,GLMs用于探索性数据分析,即从一个包含交互项的复杂全模型开始,然后通过去除不显著的项进行简化。虽然这种方法可能有用,但如果将显著效应解释为好像它们来自单个先验假设检验,就会产生问题。这是因为模型选择涉及隐含的多重假设检验,这一事实很少得到承认或量化。我们表明,即使所有原假设都为真,发现至少一个“显著”效应的概率也很高(例如,从四个预测变量及其双向交互项开始时为40%)。当样本量(N)相对于包括交互项在内的预测变量数量(k)较大时,这个概率接近理论预期。相比之下,当将模型简化应用于简化前过度拟合的模型(低N/k比)时,I型错误率甚至大大超过这些预期。假阳性结果的增加主要源于对显著预测变量效应大小的高估,导致效应大小向上偏倚,后续研究中往往无法重现(“赢家的诅咒”)。尽管全模型检验和P值调整有自身的问题,但它们可以作为仅由抽样变异导致I型错误出现频率的指南。我们赞成展示全模型,因为它们能最好地反映所研究预测变量的范围,并确保对不显著结果也有平衡的呈现。