Department of Environmental Sciences, Zoology, University of Basel, Basel, Switzerland.
J Evol Biol. 2022 Jun;35(6):777-787. doi: 10.1111/jeb.14009. Epub 2022 May 18.
A paradigm shift away from null hypothesis significance testing seems in progress. Based on simulations, we illustrate some of the underlying motivations. First, p-values vary strongly from study to study, hence dichotomous inference using significance thresholds is usually unjustified. Second, 'statistically significant' results have overestimated effect sizes, a bias declining with increasing statistical power. Third, 'statistically non-significant' results have underestimated effect sizes, and this bias gets stronger with higher statistical power. Fourth, the tested statistical hypotheses usually lack biological justification and are often uninformative. Despite these problems, a screen of 48 papers from the 2020 volume of the Journal of Evolutionary Biology exemplifies that significance testing is still used almost universally in evolutionary biology. All screened studies tested default null hypotheses of zero effect with the default significance threshold of p = 0.05, none presented a pre-specified alternative hypothesis, pre-study power calculation and the probability of 'false negatives' (beta error rate). The results sections of the papers presented 49 significance tests on average (median 23, range 0-390). Of 41 studies that contained verbal descriptions of a 'statistically non-significant' result, 26 (63%) falsely claimed the absence of an effect. We conclude that studies in ecology and evolutionary biology are mostly exploratory and descriptive. We should thus shift from claiming to 'test' specific hypotheses statistically to describing and discussing many hypotheses (possible true effect sizes) that are most compatible with our data, given our statistical model. We already have the means for doing so, because we routinely present compatibility ('confidence') intervals covering these hypotheses.
从假设检验的零假设范式转移似乎正在进行中。基于模拟,我们说明了一些潜在的动机。首先,p 值在研究之间变化很大,因此使用显著阈值进行二分推理通常是不合理的。其次,“统计上显著”的结果高估了效应大小,随着统计效力的增加,这种偏差会减小。第三,“统计上不显著”的结果低估了效应大小,并且这种偏差随着统计效力的增加而增强。第四,所测试的统计假设通常缺乏生物学依据,并且通常没有信息。尽管存在这些问题,但对《进化生物学杂志》2020 卷的 48 篇论文进行的筛选表明,显著性检验在进化生物学中仍然几乎普遍使用。所有筛选的研究都用默认的零假设和默认的显著性阈值 p = 0.05 测试了默认的零假设,没有一个提出了预先指定的替代假设、预研究的功效计算和“假阴性”(β错误率)的概率。论文的结果部分平均提出了 49 个显著性检验(中位数为 23,范围为 0-390)。在包含“统计上不显著”结果的口头描述的 41 项研究中,有 26 项(63%)错误地声称没有效果。我们的结论是,生态学和进化生物学的研究大多是探索性和描述性的。因此,我们应该从声称对特定假设进行统计检验转变为描述和讨论与我们的数据最兼容的许多假设(可能的真实效应大小),鉴于我们的统计模型。我们已经有了这样做的手段,因为我们通常会提出涵盖这些假设的兼容性(“置信度”)区间。