Sanborn Adam N, Hills Thomas T
Department of Psychology, University of Warwick, Coventry, CV4 7AL, UK,
Psychon Bull Rev. 2014 Apr;21(2):283-300. doi: 10.3758/s13423-013-0518-9.
Null hypothesis significance testing (NHST) is the most commonly used statistical methodology in psychology. The probability of achieving a value as extreme or more extreme than the statistic obtained from the data is evaluated, and if it is low enough, the null hypothesis is rejected. However, because common experimental practice often clashes with the assumptions underlying NHST, these calculated probabilities are often incorrect. Most commonly, experimenters use tests that assume that sample sizes are fixed in advance of data collection but then use the data to determine when to stop; in the limit, experimenters can use data monitoring to guarantee that the null hypothesis will be rejected. Bayesian hypothesis testing (BHT) provides a solution to these ills because the stopping rule used is irrelevant to the calculation of a Bayes factor. In addition, there are strong mathematical guarantees on the frequentist properties of BHT that are comforting for researchers concerned that stopping rules could influence the Bayes factors produced. Here, we show that these guaranteed bounds have limited scope and often do not apply in psychological research. Specifically, we quantitatively demonstrate the impact of optional stopping on the resulting Bayes factors in two common situations: (1) when the truth is a combination of the hypotheses, such as in a heterogeneous population, and (2) when a hypothesis is composite-taking multiple parameter values-such as the alternative hypothesis in a t-test. We found that, for these situations, while the Bayesian interpretation remains correct regardless of the stopping rule used, the choice of stopping rule can, in some situations, greatly increase the chance of experimenters finding evidence in the direction they desire. We suggest ways to control these frequentist implications of stopping rules on BHT.
零假设显著性检验(NHST)是心理学中最常用的统计方法。要评估获得一个与从数据中得到的统计量一样极端或更极端的值的概率,如果这个概率足够低,就拒绝零假设。然而,由于常见的实验操作常常与NHST所依据的假设相冲突,这些计算出的概率往往是不正确的。最常见的情况是,实验者使用的检验方法假定样本量在数据收集之前就已确定,但随后却利用数据来决定何时停止;在极端情况下,实验者可以通过数据监测来确保零假设会被拒绝。贝叶斯假设检验(BHT)为这些问题提供了解决方案,因为所使用的停止规则与贝叶斯因子的计算无关。此外,对于担心停止规则可能会影响所产生的贝叶斯因子的研究人员来说,BHT在频率论性质方面有强有力的数学保证,这让人安心。在这里,我们表明这些有保证的界限范围有限,在心理学研究中常常并不适用。具体来说,我们定量地证明了在两种常见情况下,选择性停止对所得贝叶斯因子的影响:(1)当真相是假设的组合时,比如在异质总体中;(2)当一个假设是复合的——取多个参数值时,比如t检验中的备择假设。我们发现,对于这些情况,尽管无论使用何种停止规则,贝叶斯解释仍然是正确的,但在某些情况下,停止规则的选择会大大增加实验者找到他们所期望方向证据的机会。我们提出了一些方法来控制停止规则对BHT的这些频率论影响。