NIWA (National Institute of Water & Atmospheric Research), PO Box 11115, Hamilton, 3216, New Zealand,
Environ Monit Assess. 2014 May;186(5):2729-40. doi: 10.1007/s10661-013-3574-8. Epub 2013 Dec 20.
Interpreting a P value from a traditional nil hypothesis test as a strength-of-evidence for the existence of an environmentally important difference between two populations of continuous variables (e.g. a chemical concentration) has become commonplace. Yet, there is substantial literature, in many disciplines, that faults this practice. In particular, the hypothesis tested is virtually guaranteed to be false, with the result that P depends far too heavily on the number of samples collected (the 'sample size'). The end result is a swinging burden-of-proof (permissive at low sample size but precautionary at large sample size). We propose that these tests be reinterpreted as direction detectors (as has been proposed by others, starting from 1960) and that the test's procedure be performed simultaneously with two types of equivalence tests (one testing that the difference that does exist is contained within an interval of indifference, the other testing that it is beyond that interval-also known as bioequivalence testing). This gives rise to a strength-of-evidence procedure that lends itself to a simple confidence interval interpretation. It is accompanied by a strength-of-evidence matrix that has many desirable features: not only a strong/moderate/dubious/weak categorisation of the results, but also recommendations about the desirability of collecting further data to strengthen findings.
将传统零假设检验中的 P 值解释为两个连续变量(例如化学浓度)群体之间存在环境重要差异的证据强度已变得司空见惯。然而,在许多学科中,有大量文献批评这种做法。特别是,所检验的假设几乎可以肯定是错误的,结果是 P 值过于依赖于所收集的样本数量(“样本量”)。最终结果是证明负担的摆动(在样本量低时允许,但在样本量大时谨慎)。我们建议将这些检验重新解释为方向探测器(正如其他人从 1960 年开始提出的那样),并同时进行两种等效性检验(一种检验确实存在的差异是否包含在无差异区间内,另一种检验差异是否超出该区间——也称为生物等效性检验)。这就产生了一种证据强度检验程序,它可以进行简单的置信区间解释。它伴随着一个证据强度矩阵,具有许多理想的特征:不仅对结果进行了强有力/适度/可疑/弱的分类,还对收集进一步数据以加强发现的可取性提出了建议。