Steyerberg E W, Eijkemans M J, Harrell F E, Habbema J D
Center for Clinical Decision Sciences, Department of Public Health, Erasmus University, Rotterdam, The Netherlands.
Stat Med. 2000 Apr 30;19(8):1059-79. doi: 10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0.
Logistic regression analysis may well be used to develop a prognostic model for a dichotomous outcome. Especially when limited data are available, it is difficult to determine an appropriate selection of covariables for inclusion in such models. Also, predictions may be improved by applying some sort of shrinkage in the estimation of regression coefficients. In this study we compare the performance of several selection and shrinkage methods in small data sets of patients with acute myocardial infarction, where we aim to predict 30-day mortality. Selection methods included backward stepwise selection with significance levels alpha of 0.01, 0.05, 0. 157 (the AIC criterion) or 0.50, and the use of qualitative external information on the sign of regression coefficients in the model. Estimation methods included standard maximum likelihood, the use of a linear shrinkage factor, penalized maximum likelihood, the Lasso, or quantitative external information on univariable regression coefficients. We found that stepwise selection with a low alpha (for example, 0.05) led to a relatively poor model performance, when evaluated on independent data. Substantially better performance was obtained with full models with a limited number of important predictors, where regression coefficients were reduced with any of the shrinkage methods. Incorporation of external information for selection and estimation improved the stability and quality of the prognostic models. We therefore recommend shrinkage methods in full models including prespecified predictors and incorporation of external information, when prognostic models are constructed in small data sets.
逻辑回归分析很可能用于开发一个针对二分结局的预后模型。尤其是当可用数据有限时,很难确定纳入此类模型的协变量的合适选择。此外,通过在回归系数估计中应用某种收缩方法,预测效果可能会得到改善。在本研究中,我们比较了几种选择和收缩方法在急性心肌梗死患者小数据集中的性能,我们的目标是预测30天死亡率。选择方法包括显著性水平α为0.01、0.05、0.157(AIC准则)或0.50的向后逐步选择,以及在模型中使用关于回归系数符号的定性外部信息。估计方法包括标准最大似然法、使用线性收缩因子、惩罚最大似然法、套索法,或关于单变量回归系数的定量外部信息。我们发现,当在独立数据上进行评估时,低α值(例如0.05)的逐步选择导致相对较差的模型性能。对于具有有限数量重要预测变量的完整模型,使用任何一种收缩方法降低回归系数时,性能得到了显著改善。纳入用于选择和估计的外部信息提高了预后模型的稳定性和质量。因此,当在小数据集中构建预后模型时,我们建议在包含预先指定预测变量的完整模型中使用收缩方法,并纳入外部信息。