Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.
Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
Clin Trials. 2020 Dec;17(6):597-606. doi: 10.1177/1740774520940256. Epub 2020 Sep 15.
More than 95% of recent cancer randomized controlled trials used the log-rank test to detect a treatment difference making it the predominant tool for comparing two survival functions. As with other tests, the log-rank test has both advantages and disadvantages. One advantage is that it offers the highest power against proportional hazards differences, which may be a major reason why alternative methods have rarely been employed in practice. The performance of statistical tests has traditionally been investigated both theoretically and numerically for several patterns of difference between two survival functions. However, to the best of our knowledge, there has been no attempt to compare the performance of various statistical tests using empirical data from past oncology randomized controlled trials. So, it is unknown whether the log-rank test offers a meaningful power advantage over alternative testing methods in contemporary cancer randomized controlled trials. Focusing on recently reported phase III cancer randomized controlled trials, we assessed whether the log-rank test gave meaningfully greater power when compared with five alternative testing methods: generalized Wilcoxon, test based on maximum of test statistics from multiple weighted log-rank tests, difference in -year event rate, and difference in restricted mean survival time with fixed and adaptive .
Using manuscripts from cancer randomized controlled trials recently published in high-tier clinical journals, we reconstructed patient-level data for overall survival (69 trials) and progression-free survival (54 trials). For each trial endpoint, we estimated the empirical power of each test. Empirical power was measured as the proportion of trials for which a test would have identified a significant result ( value < .05).
For overall survival, -year event rate offered the lowest (30.4%) empirical power and restricted mean survival time with fixed offered the highest (43.5%). The empirical power of the other types of tests was almost identical (36.2%-37.7%). For progression-free survival, the tests we investigated offered numerically equivalent empirical power (55.6%-61.1%). No single test consistently outperformed any other test.
The empirical power assessment with the past cancer randomized controlled trials provided new insights on the performance of statistical tests. Although the log-rank test has been used in almost all trials, our study suggests that the log-rank test is not the only option from an empirical power perspective. Near universal use of the log-rank test is not supported by a meaningful difference in empirical power. Clinical trial investigators could consider alternative methods, beyond the log-rank test, for their primary analysis when designing a cancer randomized controlled trial. Factors other than power (e.g. interpretability of the estimated treatment effect) should garner greater consideration when selecting statistical tests for cancer randomized controlled trials.
超过 95%的近期癌症随机对照试验采用对数秩检验来检测治疗差异,使其成为比较两种生存函数的主要工具。与其他检验方法一样,对数秩检验有其优缺点。一个优点是,它在针对比例风险差异时具有最高的功效,这可能是替代方法在实践中很少被采用的主要原因。统计检验的性能在理论上和数值上都针对两种生存函数之间的几种差异模式进行了研究。然而,据我们所知,还没有人试图使用过去肿瘤学随机对照试验的经验数据来比较各种统计检验的性能。因此,尚不清楚在当代癌症随机对照试验中,对数秩检验相对于替代检验方法是否具有有意义的功效优势。本研究关注最近报道的 III 期癌症随机对照试验,评估了与五种替代检验方法相比,对数秩检验是否具有更有意义的功效:广义 Wilcoxon 检验、基于多个加权对数秩检验统计量最大值的检验、-年事件发生率差异检验以及固定和自适应受限平均生存时间差异检验。
使用最近在高档次临床期刊上发表的癌症随机对照试验的手稿,我们为总生存(69 项试验)和无进展生存(54 项试验)重建了患者水平数据。对于每个试验终点,我们估计了每种检验的经验功效。经验功效以试验中能够识别出显著结果(值<0.05)的比例来衡量。
对于总生存,-年事件发生率提供的功效最低(30.4%),而固定受限平均生存时间提供的功效最高(43.5%)。其他类型检验的功效几乎相同(36.2%-37.7%)。对于无进展生存,我们研究的检验提供了数值等效的功效(55.6%-61.1%)。没有一种检验始终优于任何其他检验。
使用过去的癌症随机对照试验进行的经验功效评估为统计检验的性能提供了新的见解。尽管对数秩检验几乎在所有试验中都得到了应用,但我们的研究表明,从经验功效的角度来看,对数秩检验并不是唯一的选择。从经验功效的角度来看,对数秩检验的广泛应用并没有带来有意义的差异。癌症随机对照试验的设计者在进行试验时,可以考虑除对数秩检验之外的替代方法作为主要分析方法。在选择癌症随机对照试验的统计检验方法时,应更多地考虑功效以外的因素(例如,治疗效果估计的可解释性)。