Institut für Klinische Epidemiologie, Medizinische Fakultät, Martin-Luther-Universität Halle-Wittenberg, Magdeburger Str. 8, 06097, Halle (Saale), Germany.
Eur J Epidemiol. 2010 Apr;25(4):225-30. doi: 10.1007/s10654-010-9440-x. Epub 2010 Mar 26.
Since its introduction into the biomedical literature, statistical significance testing (abbreviated as SST) caused much debate. The aim of this perspective article is to review frequent fallacies and misuses of SST in the biomedical field and to review a potential way out of the fallacies and misuses associated with SSTs. Two frequentist schools of statistical inference merged to form SST as it is practised nowadays: the Fisher and the Neyman-Pearson school. The P-value is both reported quantitatively and checked against the alpha-level to produce a qualitative dichotomous measure (significant/nonsignificant). However, a P-value mixes the estimated effect size with its estimated precision. Obviously, it is not possible to measure these two things with one single number. For the valid interpretation of SSTs, a variety of presumptions and requirements have to be met. We point here to four of them: study size, correct statistical model, correct causal model, and absence of bias and confounding. It has been stated that the P-value is perhaps the most misunderstood statistical concept in clinical research. As in the social sciences, the tyranny of SST is still highly prevalent in the biomedical literature even after decades of warnings against SST. The ubiquitous misuse and tyranny of SST threatens scientific discoveries and may even impede scientific progress. In the worst case, misuse of significance testing may even harm patients who eventually are incorrectly treated because of improper handling of P-values. For a proper interpretation of study results, both estimated effect size and estimated precision are necessary ingredients.
自引入生物医学文献以来,统计显著性检验(简称 SST)引起了广泛争议。本文旨在回顾生物医学领域中 SST 常见的谬误和误用,并探讨一种潜在的解决方案,以避免与 SST 相关的谬误和误用。如今实践中使用的 SST 是由两种频率派统计推断学派——Fisher 学派和 Neyman-Pearson 学派——合并形成的。P 值既是定量报告的,也是与 alpha 水平进行比较的,以产生定性的二分测量(显著/不显著)。然而,P 值将估计的效应大小与其估计的精度混合在一起。显然,不可能用一个单一的数字来衡量这两件事。为了正确解释 SST,需要满足各种假设和要求。我们在这里指出其中四个:研究规模、正确的统计模型、正确的因果模型以及不存在偏差和混杂。有人指出,P 值可能是临床研究中最被误解的统计概念。与社会科学一样,即使经过几十年对 SST 的警告,SST 的暴政在生物医学文献中仍然非常普遍。SST 的普遍误用和暴政威胁着科学发现,甚至可能阻碍科学进步。在最坏的情况下,误用显著性检验可能会损害患者,因为对 P 值的不当处理导致他们的治疗不当。为了正确解释研究结果,估计的效应大小和估计的精度都是必要的组成部分。