Harrell F E, Lee K L, Califf R M, Pryor D B, Rosati R A
Stat Med. 1984 Apr-Jun;3(2):143-52. doi: 10.1002/sim.4780030207.
Regression models such as the Cox proportional hazards model have had increasing use in modelling and estimating the prognosis of patients with a variety of diseases. Many applications involve a large number of variables to be modelled using a relatively small patient sample. Problems of overfitting and of identifying important covariates are exacerbated in analysing prognosis because the accuracy of a model is more a function of the number of events than of the sample size. We used a general index of predictive discrimination to measure the ability of a model developed on training samples of varying sizes to predict survival in an independent test sample of patients suspected of having coronary artery disease. We compared three methods of model fitting: (1) standard 'step-up' variable selection, (2) incomplete principal components regression, and (3) Cox model regression after developing clinical indices from variable clusters. We found regression using principal components to offer superior predictions in the test sample, whereas regression using indices offers easily interpretable models nearly as good as the principal components models. Standard variable selection has a number of deficiencies.
诸如Cox比例风险模型之类的回归模型在对各种疾病患者的预后进行建模和评估方面的应用越来越广泛。许多应用涉及使用相对较小的患者样本对大量变量进行建模。在分析预后时,过度拟合和识别重要协变量的问题会更加突出,因为模型的准确性更多地取决于事件数量而非样本大小。我们使用了一种预测判别通用指标来衡量基于不同大小训练样本开发的模型对疑似患有冠状动脉疾病的独立测试样本患者的生存预测能力。我们比较了三种模型拟合方法:(1)标准的“逐步”变量选择,(2)不完全主成分回归,以及(3)从变量聚类中开发临床指标后进行的Cox模型回归。我们发现,在测试样本中,使用主成分进行回归能提供更好的预测,而使用指标进行回归能提供易于解释的模型,其效果几乎与主成分模型一样好。标准变量选择存在许多不足之处。