Peduzzi P, Concato J, Kemper E, Holford T R, Feinstein A R
Cooperative Studies Program Coordinating Center, Veterans Affairs Medical Center, West Haven Connecticut 06516, USA.
J Clin Epidemiol. 1996 Dec;49(12):1373-9. doi: 10.1016/s0895-4356(96)00236-3.
We performed a Monte Carlo study to evaluate the effect of the number of events per variable (EPV) analyzed in logistic regression analysis. The simulations were based on data from a cardiac trial of 673 patients in which 252 deaths occurred and seven variables were cogent predictors of mortality; the number of events per predictive variable was (252/7 =) 36 for the full sample. For the simulations, at values of EPV = 2, 5, 10, 15, 20, and 25, we randomly generated 500 samples of the 673 patients, chosen with replacement, according to a logistic model derived from the full sample. Simulation results for the regression coefficients for each variable in each group of 500 samples were compared for bias, precision, and significance testing against the results of the model fitted to the original sample. For EPV values of 10 or greater, no major problems occurred. For EPV values less than 10, however, the regression coefficients were biased in both positive and negative directions; the large sample variance estimates from the logistic model both overestimated and underestimated the sample variance of the regression coefficients; the 90% confidence limits about the estimated values did not have proper coverage; the Wald statistic was conservative under the null hypothesis; and paradoxical associations (significance in the wrong direction) were increased. Although other factors (such as the total number of events, or sample size) may influence the validity of the logistic model, our findings indicate that low EPV can lead to major problems.
我们进行了一项蒙特卡罗研究,以评估逻辑回归分析中每个变量的事件数(EPV)的影响。模拟基于一项针对673名患者的心脏试验数据,其中发生了252例死亡,7个变量是死亡率的有力预测指标;整个样本中每个预测变量的事件数为(252/7 =)36。对于模拟,在EPV = 2、5、10、15、20和25时,我们根据从整个样本得出的逻辑模型,随机抽取了500个673名患者的样本,抽样时可重复选取。将每组500个样本中每个变量的回归系数的模拟结果,在偏差、精度和显著性检验方面与拟合原始样本的模型结果进行比较。对于EPV值为10或更大的情况,未出现重大问题。然而,对于EPV值小于10的情况,回归系数在正负两个方向上都存在偏差;逻辑模型的大样本方差估计既高估又低估了回归系数的样本方差;估计值的90%置信区间没有适当的覆盖范围;在原假设下,Wald统计量较为保守;并且反常关联(错误方向上的显著性)有所增加。尽管其他因素(如事件总数或样本量)可能会影响逻辑模型的有效性,但我们的研究结果表明,低EPV可能会导致重大问题。