Ogundimu Emmanuel O, Altman Douglas G, Collins Gary S
Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology & Musculoskeletal Diseases, Botnar Research Centre, University of Oxford, Windmill Road, Oxford OX3 7LD, UK.
Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology & Musculoskeletal Diseases, Botnar Research Centre, University of Oxford, Windmill Road, Oxford OX3 7LD, UK.
J Clin Epidemiol. 2016 Aug;76:175-82. doi: 10.1016/j.jclinepi.2016.02.031. Epub 2016 Mar 8.
The choice of an adequate sample size for a Cox regression analysis is generally based on the rule of thumb derived from simulation studies of a minimum of 10 events per variable (EPV). One simulation study suggested scenarios in which the 10 EPV rule can be relaxed. The effect of a range of binary predictors with varying prevalence, reflecting clinical practice, has not yet been fully investigated.
We conducted an extended resampling study using a large general-practice data set, comprising over 2 million anonymized patient records, to examine the EPV requirements for prediction models with low-prevalence binary predictors developed using Cox regression. The performance of the models was then evaluated using an independent external validation data set. We investigated both fully specified models and models derived using variable selection.
Our results indicated that an EPV rule of thumb should be data driven and that EPV ≥ 20 generally eliminates bias in regression coefficients when many low-prevalence predictors are included in a Cox model.
Higher EPV is needed when low-prevalence predictors are present in a model to eliminate bias in regression coefficients and improve predictive accuracy.
Cox回归分析中合适样本量的选择通常基于经验法则,该法则源于对每个变量至少10个事件(EPV)的模拟研究。一项模拟研究提出了可以放宽10个EPV规则的情形。反映临床实践的一系列患病率不同的二元预测变量的影响尚未得到充分研究。
我们使用一个大型全科医疗数据集进行了一项扩展重采样研究,该数据集包含超过200万条匿名患者记录,以检验使用Cox回归开发的具有低患病率二元预测变量的预测模型的EPV要求。然后使用独立的外部验证数据集评估模型的性能。我们研究了完全指定的模型和使用变量选择得出的模型。
我们的结果表明,EPV经验法则应基于数据驱动,并且当Cox模型中包含许多低患病率预测变量时,EPV≥20通常可消除回归系数中的偏差。
当模型中存在低患病率预测变量时,需要更高的EPV来消除回归系数中的偏差并提高预测准确性。