MRC Clinical Trials Unit at UCL, Institute of Clinical Trials and Methodology, 90 High Holborn, London, WC1V 6LJ, UK.
Trials. 2020 Apr 6;21(1):315. doi: 10.1186/s13063-020-4153-2.
The logrank test is routinely applied to design and analyse randomized controlled trials (RCTs) with time-to-event outcomes. Sample size and power calculations assume the treatment effect follows proportional hazards (PH). If the PH assumption is false, power is reduced and interpretation of the hazard ratio (HR) as the estimated treatment effect is compromised. Using statistical simulation, we investigated the type 1 error and power of the logrank (LR)test and eight alternatives. We aimed to identify test(s) that improve power with three types of non-proportional hazards (non-PH): early, late or near-PH treatment effects.
We investigated weighted logrank tests (early, LRE; late, LRL), the supremum logrank test (SupLR) and composite tests (joint, J; combined, C; weighted combined, WC; versatile and modified versatile weighted logrank, VWLR, VWLR2) with two or more components. Weighted logrank tests are intended to be sensitive to particular non-PH patterns. Composite tests attempt to improve power across a wider range of non-PH patterns. Using extensive simulations based on real trials, we studied test size and power under PH and under simple departures from PH comprising pointwise constant HRs with a single change point at various follow-up times. We systematically investigated the influence of high or low control-arm event rates on power.
With no preconceived type of treatment effect, the preferred test is VWLR2. Expecting an early effect, tests with acceptable power are SupLR, C, VWLR2, J, LRE and WC. Expecting a late effect, acceptable tests are LRL, VWLR, VWLR2, WC and J. Under near-PH, acceptable tests are LR, LRE, VWLR, C, VWLR2 and SupLR. Type 1 error was well controlled for all tests, showing only minor deviations from the nominal 5%. The location of the HR change point relative to the cumulative proportion of control-arm events considerably affected power.
Assuming ignorance of the likely treatment effect, the best choice is VWLR2. Several non-standard tests performed well when the correct type of treatment effect was assumed. A low control-arm event rate reduced the power of weighted logrank tests targeting early effects. Test size was generally well controlled. Further investigation of test characteristics with different types of non-proportional hazards of the treatment effect is warranted.
对数秩检验常用于设计和分析具有时间事件结局的随机对照试验(RCT)。样本量和功效计算假设治疗效果遵循比例风险(PH)。如果 PH 假设不成立,则功效降低,并且危害比(HR)作为估计治疗效果的解释受到影响。本研究通过统计模拟,比较了对数秩(LR)检验和八种替代检验在处理三种非比例风险(非 PH)时的Ⅰ类错误和功效,旨在寻找提高具有早期、晚期或接近 PH 治疗效果的非 PH 类型功效的检验。
我们研究了加权对数秩检验(早期,LRE;晚期,LRL)、最大对数秩检验(SupLR)和复合检验(联合,J;联合,C;加权联合,WC;通用和改良通用加权对数秩,VWLR,VWLR2),它们都有两个或多个组成部分。加权对数秩检验旨在对特定的非 PH 模式敏感。复合检验试图在更广泛的非 PH 模式范围内提高功效。我们基于真实试验进行了广泛的模拟研究,研究了 PH 下和简单的 PH 偏离下(包括在不同随访时间点具有单一变化点的恒 HR)的检验效能和功效。我们系统地研究了高或低对照组事件率对功效的影响。
在没有先验治疗效果类型的情况下,首选检验是 VWLR2。如果预期有早期效果,可接受功效的检验是 SupLR、C、VWLR2、J、LRE 和 WC。如果预期有晚期效果,可接受的检验是 LRL、VWLR、VWLR2、WC 和 J。在接近 PH 情况下,可接受的检验是 LR、LRE、VWLR、C、VWLR2 和 SupLR。所有检验的Ⅰ类错误均得到很好的控制,仅显示出与名义 5%的轻微偏差。HR 变化点相对于对照组事件累积比例的位置极大地影响了功效。
在不知道可能的治疗效果的情况下,最好的选择是 VWLR2。当假设正确的治疗效果类型时,几种非标准检验表现良好。低对照组事件率降低了针对早期效果的加权对数秩检验的功效。检验效能通常得到很好的控制。需要进一步研究不同类型的治疗效果非比例风险的检验特征。