Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S
Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ont.
CMAJ. 1995 Jan 1;152(1):27-32.
In the first of a series of four articles the authors explain the statistical concepts of hypothesis testing and p values. In many clinical trials investigators test a null hypothesis that there is no difference between a new treatment and a placebo or between two treatments. The result of a single experiment will almost always show some difference between the experimental and the control groups. Is the difference due to chance, or is it large enough to reject the null hypothesis and conclude that there is a true difference in treatment effects? Statistical tests yield a p value: the probability that the experiment would show a difference as great or greater than that observed if the null hypothesis were true. By convention, p values of less than 0.05 are considered statistically significant, and investigators conclude that there is a real difference. However, the smaller the sample size, the greater the chance of erroneously concluding that the experimental treatment does not differ from the control--in statistical terms, the power of the test may be inadequate. Tests of several outcomes from one set of data may lead to an erroneous conclusion that an outcome is significant if the joint probability of the outcomes is not taken into account. Hypothesis testing has limitations, which will be discussed in the next article in the series.
在系列四篇文章的第一篇中,作者解释了假设检验和p值的统计学概念。在许多临床试验中,研究人员会检验一个零假设,即新治疗方法与安慰剂之间或两种治疗方法之间没有差异。单次实验的结果几乎总会显示实验组与对照组之间存在一些差异。这种差异是由于偶然因素造成的,还是大到足以拒绝零假设并得出治疗效果存在真正差异的结论呢?统计检验会得出一个p值:如果零假设为真,实验显示出与观察到的差异一样大或更大差异的概率。按照惯例,p值小于0.05被认为具有统计学显著性,研究人员据此得出存在真正差异的结论。然而,样本量越小,错误地得出实验治疗与对照无差异结论的可能性就越大——从统计学角度来说,检验效能可能不足。如果不考虑一组数据中多个结果的联合概率,对这些结果进行多次检验可能会导致错误地得出某个结果具有显著性的结论。假设检验存在局限性,这将在本系列的下一篇文章中进行讨论。