Brookes Sara T, Whitely Elise, Egger Matthias, Smith George Davey, Mulheran Paul A, Peters Tim J
Department of Social Medicine, University of Bristol, Whiteladies Road, Bristol, BS8 2PR, UK.
J Clin Epidemiol. 2004 Mar;57(3):229-36. doi: 10.1016/j.jclinepi.2003.08.009.
Despite guidelines recommending the use of formal tests of interaction in subgroup analyses in clinical trials, inappropriate subgroup-specific analyses continue. Moreover, trials designed to detect overall treatment effects have limited power to detect treatment-subgroup interactions. This article quantifies the error rates associated with subgroup analyses.
Simulations quantified the risks of misinterpreting subgroup analyses as evidence of differential subgroup effects and the limited power of the interaction test in trials designed to detect overall treatment effects.
Although formal interaction tests performed as expected with respect to false positives, subgroup-specific tests were considerably less reliable: A significant effect in one subgroup only was observed in 7% to 64% of simulations depending on trial characteristics. Regarding power of the interaction test, a trial with 80% power for the overall effect had only 29% power to detect an interaction effect of the same magnitude. For interactions of this size to be detected with the same power as the overall effect, sample sizes should be inflated fourfold, increasing dramatically for interactions smaller than 20% of the overall effect.
Although it is generally recognized that subgroup analyses can produce spurious results, the extent of the problem may be underestimated.
尽管指南推荐在临床试验的亚组分析中使用正式的交互作用检验,但不恰当的亚组特异性分析仍在继续。此外,旨在检测总体治疗效果的试验检测治疗-亚组交互作用的能力有限。本文对与亚组分析相关的错误率进行了量化。
模拟量化了将亚组分析错误解读为亚组效应差异证据的风险,以及在旨在检测总体治疗效果的试验中交互作用检验的有限能力。
尽管正式的交互作用检验在假阳性方面表现符合预期,但亚组特异性检验的可靠性要低得多:根据试验特征,在7%至64%的模拟中仅在一个亚组中观察到显著效应。关于交互作用检验的效能,对总体效应有80%效能的试验检测相同大小交互作用效应的效能仅为29%。要以与总体效应相同的效能检测这种大小的交互作用,样本量应增加四倍,对于小于总体效应20%的交互作用,样本量增加幅度会显著增大。
尽管人们普遍认识到亚组分析可能产生虚假结果,但问题的严重程度可能被低估了。