Domenech Raúl J
Instituto de Ciencias Biomédicas, Facultad de Medicina, Universidad de Chile, Santiago, Chile.
Rev Med Chil. 2018 Dec;146(10):1184-1189. doi: 10.4067/S0034-98872018001001184.
Statistical inference was introduced by Fisher and Neyman-Pearson more than 90 years ago to define the probability that the difference in results between several groups is due to randomness or is a real, "significant" difference. The usual procedure is to test the probability (P) against the null hypothesis that there is no real difference except because of the inevitable sampling variability. If this probability is high we accept the null hypothesis and infer that there is no real difference, but if P is low (P < 0.05) we reject the null hypothesis and infer that there is, a "significant" difference. However, a large amount of discoveries using this method are not reproducible. Statisticians have defined the deficiencies of the method and warned the researchers that P is a very unreliable measure. Two uncertainties of the "significance" concept are described in this review: a) The inefficacy of a P value to discard the null hypothesis; b) The low probability to reproduce a P value after an exact replication of the experiment. Due to the discredit of "significance" the American Statistical Association recently stated that P values do not provide a good measure of evidence for a hypothesis. Statisticians recommend to never use the word "significant" because it is misleading. Instead, the exact P value should be stated along with the effect size and confidence intervals. Nothing greater than P = 0.001 should be considered as a demonstration that something was discovered. Currently, several alternatives are being studied to replace the classical concepts.
90多年前,费希尔(Fisher)以及奈曼 - 皮尔逊(Neyman - Pearson)引入了统计推断,以确定几组结果之间的差异是由于随机性导致,还是真实的“显著”差异。通常的做法是针对零假设检验概率(P),即除了不可避免的抽样变异性之外不存在真实差异。如果这个概率很高,我们就接受零假设并推断不存在真实差异,但如果P值很低(P < 0.05),我们就拒绝零假设并推断存在“显著”差异。然而,使用这种方法的大量发现并不可重复。统计学家已经定义了该方法的缺陷,并警告研究人员P是一个非常不可靠的度量。本综述描述了“显著性”概念的两个不确定性:a)P值无法有效摒弃零假设;b)在实验精确复制后重现P值的概率很低。由于“显著性”受到质疑,美国统计协会最近表示,P值并不能很好地衡量假设的证据。统计学家建议永远不要使用“显著”这个词,因为它具有误导性。相反,应该同时列出精确的P值以及效应大小和置信区间。任何大于P = 0.001的值都不应被视为发现了某种东西的证明。目前,正在研究几种替代方法来取代经典概念。