Brookes S T, Whitley E, Peters T J, Mulheran P A, Egger M, Davey Smith G
Department of Social Medicine, University of Bristol, UK.
Health Technol Assess. 2001;5(33):1-56. doi: 10.3310/hta5330.
Subgroup analyses are common in randomised controlled trials (RCTs). There are many easily accessible guidelines on the selection and analysis of subgroups but the key messages do not seem to be universally accepted and inappropriate analyses continue to appear in the literature. This has potentially serious implications because erroneous identification of differential subgroup effects may lead to inappropriate provision or withholding of treatment.
(1) To quantify the extent to which subgroup analyses may be misleading. (2) To compare the relative merits and weaknesses of the two most common approaches to subgroup analysis: separate (subgroup-specific) analyses of treatment effect and formal statistical tests of interaction. (3) To establish what factors affect the performance of the two approaches. (4) To provide estimates of the increase in sample size required to detect differential subgroup effects. (5) To provide recommendations on the analysis and interpretation of subgroup analyses.
The performances of subgroup-specific and formal interaction tests were assessed by simulating data with no differential subgroup effects and determining the extent to which the two approaches (incorrectly) identified such an effect, and simulating data with a differential subgroup effect and determining the extent to which the two approaches were able to (correctly) identify it. Initially, data were simulated to represent the 'simplest case' of two equal-sized treatment groups and two equal-sized subgroups. Data were first simulated with no differential subgroup effect and then with a range of types and magnitudes of subgroup effect with the sample size determined by the nominal power (50-95%) for the overall treatment effect. Additional simulations were conducted to explore the individual impact of the sample size, the magnitude of the overall treatment effect, the size and number of treatment groups and subgroups and, in the case of continuous data, the variability of the data. The simulated data covered the types of outcomes most commonly used in RCTs, namely continuous (Gaussian) variables, binary outcomes and survival times. All analyses were carried out using appropriate regression models, and subgroup effects were identified on the basis of statistical significance at the 5% level.
While there was some variation for smaller sample sizes, the results for the three types of outcome were very similar for simulations with a total sample size of greater than or equal to 200. With simulated simplest case data with no differential subgroup effects, the formal tests of interaction were significant in 5% of cases as expected, while subgroup-specific tests were less reliable and identified effects in 7-66% of cases depending on whether there was an overall treatment effect. The most common type of subgroup effect identified in this way was where the treatment effect was seen to be significant in one subgroup only. When a simulated differential subgroup effect was included, the results were dependent on the nominal power of the simulated data and the type and magnitude of the subgroup effect. However, the performance of the formal interaction test was generally superior to that of the subgroup-specific analyses, with more differential effects correctly identified. In addition, the subgroup-specific analyses often suggested the wrong type of differential effect. The ability of formal interaction tests to (correctly) identify subgroup effects improved as the size of the interaction increased relative to the overall treatment effect. When the size of the interaction was twice the overall effect or greater, the interaction tests had at least the same power as the overall treatment effect. However, power was considerably reduced for smaller interactions, which are much more likely in practice. The inflation factor required to increase the sample size to enable detection of the interaction with the same power as the overall effect varied with the size of the interaction. For an interaction of the same magnitude as the overall effect, the inflation factor was 4, and this increased dramatically to of greater than or equal to 100 for more subtle interactions of < 20% of the overall effect. Formal interaction tests were generally robust to alterations in the number and size of the treatment and subgroups and, for continuous data, the variance in the treatment groups, with the only exception being a change in the variance in one of the subgroups. In contrast, the performance of the subgroup-specific tests was affected by almost all of these factors with only a change in the number of treatment groups having no impact at all.
While it is generally recognised that subgroup analyses can produce spurious results, the extent of the problem is almost certainly under-estimated. This is particularly true when subgroup-specific analyses are used. In addition, the increase in sample size required to identify differential subgroup effects may be substantial and the commonly used 'rule of four' may not always be sufficient, especially when interactions are relatively subtle, as is often the case. CONCLUSIONS--RECOMMENDATIONS FOR SUBGROUP ANALYSES AND THEIR INTERPRETATION: (1) Subgroup analyses should, as far as possible, be restricted to those proposed before data collection. Any subgroups chosen after this time should be clearly identified. (2) Trials should ideally be powered with subgroup analyses in mind. However, for modest interactions, this may not be feasible. (3) Subgroup-specific analyses are particularly unreliable and are affected by many factors. Subgroup analyses should always be based on formal tests of interaction although even these should be interpreted with caution. (4) The results from any subgroup analyses should not be over-interpreted. Unless there is strong supporting evidence, they are best viewed as a hypothesis-generation exercise. In particular, one should be wary of evidence suggesting that treatment is effective in one subgroup only. (5) Any apparent lack of differential effect should be regarded with caution unless the study was specifically powered with interactions in mind. CONCLUSIONS--RECOMMENDATIONS FOR RESEARCH: (1) The implications of considering confidence intervals rather than p-values could be considered. (2) The same approach as in this study could be applied to contexts other than RCTs, such as observational studies and meta-analyses. (3) The scenarios used in this study could be examined more comprehensively using other statistical methods, incorporating clustering effects, considering other types of outcome variable and using other approaches, such as Bootstrapping or Bayesian methods.
亚组分析在随机对照试验(RCT)中很常见。有许多关于亚组选择和分析的易于获取的指南,但关键信息似乎并未得到普遍认可,文献中仍不断出现不恰当的分析。这可能会产生严重影响,因为错误识别亚组间的差异效应可能导致不恰当的治疗提供或不提供。
(1)量化亚组分析可能产生误导的程度。(2)比较亚组分析两种最常见方法的相对优缺点:治疗效果的单独(亚组特异性)分析和交互作用的正式统计检验。(3)确定影响这两种方法性能的因素。(4)估计检测亚组间差异效应所需增加的样本量。(5)提供关于亚组分析的分析和解释的建议。
通过模拟无亚组间差异效应的数据并确定两种方法(错误地)识别这种效应的程度,以及模拟有亚组间差异效应的数据并确定两种方法能够(正确地)识别它的程度,来评估亚组特异性分析和正式交互作用检验的性能。最初,模拟数据以代表两个规模相等的治疗组和两个规模相等的亚组的“最简单情况”。首先模拟无亚组间差异效应的数据,然后模拟一系列亚组效应的类型和大小,样本量由总体治疗效应的名义检验效能(50 - 95%)确定。进行了额外的模拟,以探讨样本量、总体治疗效应大小、治疗组和亚组的大小及数量的个体影响,对于连续数据,还探讨了数据的变异性。模拟数据涵盖了RCT中最常用的结局类型,即连续(高斯)变量、二元结局和生存时间。所有分析均使用适当的回归模型进行,亚组效应基于5%水平的统计学显著性来识别。
虽然对于较小样本量存在一些差异,但对于总样本量大于或等于200的模拟,三种结局类型的结果非常相似。对于模拟的无亚组间差异效应的最简单情况数据,交互作用的正式检验在5%的情况下如预期那样具有显著性,而亚组特异性检验不太可靠,根据是否存在总体治疗效应,在7 - 66%的情况下识别出效应。以这种方式识别出的最常见的亚组效应类型是仅在一个亚组中治疗效应显著。当纳入模拟的亚组间差异效应时,结果取决于模拟数据的名义检验效能以及亚组效应的类型和大小。然而,正式交互作用检验的性能通常优于亚组特异性分析,能正确识别更多的差异效应。此外,亚组特异性分析常常提示错误类型的差异效应。正式交互作用检验(正确地)识别亚组效应的能力随着交互作用大小相对于总体治疗效应的增加而提高。当交互作用大小是总体效应的两倍或更大时,交互作用检验至少具有与总体治疗效应相同的检验效能。然而,对于较小的交互作用,检验效能会大幅降低,而在实际中这种情况更常见。将样本量增加到能够以与总体效应相同的检验效能检测交互作用所需的膨胀因子随交互作用大小而变化。对于与总体效应大小相同的交互作用,膨胀因子为4,对于小于总体效应20%的更细微交互作用,该因子急剧增加到大于或等于100。正式交互作用检验通常对治疗组和亚组的数量及大小的改变具有稳健性,对于连续数据,对治疗组的方差改变也具有稳健性,唯一的例外是一个亚组中方差的改变。相比之下,亚组特异性检验的性能几乎受到所有这些因素的影响,只有治疗组数量的改变完全没有影响。
虽然人们普遍认识到亚组分析可能产生虚假结果,但问题的严重程度几乎肯定被低估了。当使用亚组特异性分析时尤其如此。此外,识别亚组间差异效应所需增加的样本量可能很大,常用的“四法则”可能并不总是足够的,特别是当交互作用相对细微时,实际情况往往如此。结论——亚组分析及其解释的建议:(1)亚组分析应尽可能限于数据收集前提出的那些。在此之后选择的任何亚组都应明确标识。(2)理想情况下,试验设计应考虑到亚组分析的检验效能。然而,对于适度的交互作用,这可能不可行。(3)亚组特异性分析特别不可靠,且受许多因素影响。亚组分析应始终基于交互作用的正式检验,尽管即使是这些检验也应谨慎解释。(4)任何亚组分析的结果都不应过度解读。除非有强有力的支持证据,否则最好将它们视为一种假设生成练习。特别是,应警惕仅表明治疗在一个亚组中有效的证据。(5)任何明显缺乏差异效应的情况都应谨慎对待,除非该研究在设计时特别考虑了交互作用。结论——对研究的建议:(1)可以考虑置信区间而非p值的影响。(2)本研究中使用的相同方法可应用于RCT之外的其他情境,如观察性研究和荟萃分析。(3)本研究中使用的场景可以使用其他统计方法进行更全面的检验,纳入聚类效应,考虑其他类型的结局变量,并使用其他方法,如自助法或贝叶斯方法。