Medical College of Wisconsin, Milwaukee, WI, USA.
Bone Marrow Transplant. 2023 May;58(5):474-477. doi: 10.1038/s41409-023-01935-3. Epub 2023 Mar 3.
We are pleased to add this typescript, Inappropriate use of statistical power by Raphael Fraser to the BONE MARROW TRANSPLANTATION Statistics Series. The authour discusses how we sometimes misuse statistical analyses after a study is completed and analyzed to explain the results. The most egregious example is post hoc power calculations.When the conclusion of an observational study or clinical trial is negative, namely, the data observed (or more extreme data) fail to reject the null hypothesis, people often argue for calculating the observed statistical power. This is especially true of clinical trialists believing in a new therapy who wished and hoped for a favorable outcome (rejecting the null hypothesis). One is reminded of the saying from Benjamin Franklin: A man convinced against his will is of the same opinion still.As the authour notes, when we face a negative conclusion of a clinical trial there are two possibilities: (1) there is no treatment effect; or (2) we made a mistake. By calculating the observed power after the study, people (incorrectly) believe if the observed power is high there is strong support for the null hypothesis. However, the problem is usually the opposite: if the observed power is low, the null hypothesis was not rejected because there were too few subjects. This is usually couched in terms such as: there was a trend towards… or we failed to detect a benefit because we had too few subjects or the like. Observed power should not be used to interpret results of a negative study. Put more strongly, observed power should not be calculated after a study is completed and analyzed. The power of the study to reject or not the null hypothesis is already incorporated in the calculation of the p value.The authour use interesting analogies to make important points about hypothesis testing. Testing the null hypothesis is like a jury trial. The jury can find the plaintiff guilty or not guilty. They cannot find him innocent. It is always important to recall failure to reject the null hypothesis does not mean the null hypothesis is true, simply there are insufficient evidence (data) to reject it. As the author notes: In a sense, hypothesis testing is like world championship boxing where the null hypothesis is the champion until defeated by the challenger, the alternative hypothesis, to become the new world champion.The authour include a discussion of what is a p-value, a topic we discussed before in this series and elsewhere [1, 2]. Finally, there is a nice discussion of confidence intervals (frequentist) and credibility limits (Bayesian). A frequentist interpretation views probability as the limit of the relative frequency of an event after many trials. In contrast, a Bayesian interpretation views probability in the context of a degree of belief in an event . This belief could be based on prior knowledge such as the results of previous trials, biological plausibility or personal beliefs (my drug is better than your drug). The important point is the common mis-interpretation of confidence intervals. For example, many researchers interpret a 95 percent confidence interval to mean there is a 95 percent chance this interval contains the parameter value. This is wrong. It means, if we repeat the identical study many times 95 percent of the intervals will contain the true but unknown parameter in the population. This will seem strange to many people because we are interested only in the study we are analyzing, not in repeating the same study-design many times.We hope readers will enjoy this well-written summary of common statistical errors, especially post hoc calculations of observed power. Going forth we hope to ban statements like there was a trend towards… or we failed to detect a benefit because we had too few subjects from the Journal. Reviewers have been advised. Proceed at your own risk. Robert Peter Gale MD, PhD, DSc(hc), FACP, FRCP, FRCPI(hon), FRSM, Imperial College London, Mei-Jie Zhang PhD, Medical College of Wisconsin.
我们很高兴将 Raphael Fraser 的《不恰当地使用统计功效》这篇专题论文纳入《骨髓移植统计学系列》。作者讨论了我们有时如何在研究完成并进行分析后,为了解释结果而错误地使用统计分析。最恶劣的例子是事后功效计算。当观察性研究或临床试验的结论为阴性时,即观察到的数据(或更极端的数据)未能拒绝零假设,人们通常会主张计算观察到的统计功效。对于相信新疗法的临床试验人员来说,尤其如此,他们希望并希望有一个有利的结果(拒绝零假设)。本杰明·富兰克林(Benjamin Franklin)有句话说得好:人在不情愿的情况下被说服,其意见仍然不变。正如作者所指出的,当我们面对临床试验的负面结论时,有两种可能性:(1)没有治疗效果;或(2)我们犯了一个错误。通过在研究后计算观察到的功效,人们(错误地)认为,如果观察到的功效很高,那么对零假设的支持就很强。然而,问题通常恰恰相反:如果观察到的功效较低,那么零假设没有被拒绝,因为受试者太少。这通常被表述为:有……的趋势,或者我们由于受试者太少而未能检测到益处等等。不应该使用观察到的功效来解释阴性研究的结果。更强烈地说,不应该在研究完成并进行分析后计算观察到的功效。研究拒绝或不拒绝零假设的功效已经包含在 p 值的计算中。
作者使用有趣的类比来阐明关于假设检验的重要观点。检验零假设就像陪审团审判一样。陪审团可以判定原告有罪或无罪。他们不能判定他无罪。重要的是,要始终记住,未能拒绝零假设并不意味着零假设是正确的,只是没有足够的证据(数据)来拒绝它。正如作者指出的:从某种意义上说,假设检验就像世界冠军拳击赛一样,在被挑战者,即替代假设击败之前,零假设是冠军,成为新的世界冠军。
作者包括了对 p 值的讨论,这是我们之前在本系列和其他地方[1,2]讨论过的话题。最后,还有关于置信区间(频率主义者)和可信度限制(贝叶斯主义者)的精彩讨论。频率主义者将概率视为在多次试验后事件的相对频率的极限。相比之下,贝叶斯解释将概率置于对事件的置信度的背景下。这种信念可以基于先前的知识,如先前试验的结果、生物学合理性或个人信仰(我的药物比你的药物好)。重要的是,对置信区间的常见误解。例如,许多研究人员将 95%置信区间解释为该区间包含参数值的可能性为 95%。这是错误的。这意味着,如果我们重复相同的研究很多次,那么 95%的区间将包含人群中真实但未知的参数。这对许多人来说似乎很奇怪,因为我们只对我们正在分析的研究感兴趣,而不是对重复相同的研究设计感兴趣。
我们希望读者会喜欢这篇关于常见统计错误的精彩总结,特别是事后计算观察到的功效。今后,我们希望从杂志上禁止出现“有……的趋势”或“由于我们的受试者太少,我们未能检测到益处”之类的说法。审稿人已经得到了建议。请自行承担风险。Robert Peter Gale MD, PhD, DSc(hc), FACP, FRCP, FRCPI(hon), FRSM,伦敦帝国理工学院,Mei-Jie Zhang PhD,威斯康星医学院。