Ahmed Ali Usama, Ten Hove Joren R, Reiber Beata M, van der Sluis Pieter C, Besselink Marc G
Department of Surgery, University Medical Center Utrecht, Utrecht, The Netherlands.
Department of Surgery, University Medical Center Utrecht, Utrecht, The Netherlands.
J Surg Res. 2018 Aug;228:1-7. doi: 10.1016/j.jss.2018.02.014. Epub 2018 Mar 21.
Interpretation of randomized controlled trials (RCTs) without a significant difference regarding the primary outcome (negative RCTs) is frequently challenging, due to concerns about sample size and thus sufficient statistical power. We aimed to assess the adequacy of sample size and corresponding power of surgical RCTs.
We previously identified all surgical RCTs available in PubMed in two distinct years a decade apart (1999 and 2009). For all "negative" trials, we estimated whether the sample size of the trial was appropriate to detect a difference in the primary outcome measure. The main outcome measure was a sufficient sample size to detect large, medium, and small treatment effects. We also performed a post hoc power analysis based on the actual observed effect difference.
A total of 228 negative RCTs (74 in 1999 and 121 in 2009) were included. The median sample size was 76 (± 222) and 80 (± 163) in 1999 and 2009, respectively. Sample size calculation was increasingly reported from 40% in 1999 to 54% in 2009 (P = 0.02). The proportion of studies adequately powered to detect large (57% versus 68%), medium (26% versus 25%), or small (8% versus 7%) differences did not differ significantly between 1999 and 2009, respectively. To reach sufficient power, the required increases in sample size were 130%, 240%, and 1032% for large, medium, and small differences, respectively. Reporting a sample size calculation was the only independent predictor for adequate power.
Despite slight improvement in the reporting of a sample size calculation, about a third of surgical trials remains underpowered to demonstrate differences that are likely to be clinically significant. Increased attention of researchers, medical ethical boards, and journal editors is required to reduce potentially wasted resources on underpowered trials.
由于对样本量及相应统计效能的担忧,对主要结局无显著差异的随机对照试验(RCT,即阴性RCT)进行解读往往具有挑战性。我们旨在评估外科RCT的样本量及相应效能是否充足。
我们先前在间隔十年的两个不同年份(1999年和2009年)确定了PubMed中所有可用的外科RCT。对于所有“阴性”试验,我们估计该试验的样本量是否足以检测主要结局指标的差异。主要结局指标是足以检测大、中、小治疗效果的样本量。我们还根据实际观察到的效应差异进行了事后效能分析。
共纳入228项阴性RCT(1999年74项,2009年121项)。1999年和2009年的样本量中位数分别为76(±222)和80(±163)。样本量计算的报告比例从1999年的40%增至2009年的54%(P = 0.02)。1999年和2009年分别检测大(57%对68%)、中(26%对25%)或小(8%对7%)差异的效能充足的研究比例无显著差异。为达到足够效能,对于大、中、小差异,所需的样本量增加分别为130%、240%和1032%。报告样本量计算是效能充足的唯一独立预测因素。
尽管样本量计算报告略有改进,但约三分之一的外科试验效能仍不足,无法证明可能具有临床意义的差异。研究人员、医学伦理委员会和期刊编辑需要更多关注,以减少在效能不足的试验上可能浪费的资源。