Schumacher Martin, Binder Harald, Gerds Thomas
Department of Medical Biometry and Statistics, Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Germany.
Bioinformatics. 2007 Jul 15;23(14):1768-74. doi: 10.1093/bioinformatics/btm232. Epub 2007 May 7.
In the process of developing risk prediction models, various steps of model building and model selection are involved. If this process is not adequately controlled, overfitting may result in serious overoptimism leading to potentially erroneous conclusions.
For right censored time-to-event data, we estimate the prediction error for assessing the performance of a risk prediction model (Gerds and Schumacher, 2006; Graf et al., 1999). Furthermore, resampling methods are used to detect overfitting and resulting overoptimism and to adjust the estimates of prediction error (Gerds and Schumacher, 2007).
We show how and to what extent the methodology can be used in situations characterized by a large number of potential predictor variables where overfitting may be expected to be overwhelming. This is illustrated by estimating the prediction error of some recently proposed techniques for fitting a multivariate Cox regression model applied to the data of a prognostic study in patients with diffuse large-B-cell lymphoma (DLBCL).
Resampling-based estimation of prediction error curves is implemented in an R package called pec available from the authors.
在开发风险预测模型的过程中,涉及模型构建和模型选择的各个步骤。如果这个过程没有得到充分控制,过度拟合可能会导致严重的过度乐观,从而得出潜在的错误结论。
对于右删失的事件发生时间数据,我们估计预测误差以评估风险预测模型的性能(格茨和舒马赫,2006年;格拉夫等人,1999年)。此外,重采样方法用于检测过度拟合及由此产生的过度乐观,并调整预测误差的估计值(格茨和舒马赫,2007年)。
我们展示了该方法如何以及在多大程度上可用于存在大量潜在预测变量的情况,在这些情况下预计过度拟合会非常严重。通过估计一些最近提出的用于拟合多变量Cox回归模型的技术的预测误差来说明这一点,这些技术应用于弥漫性大B细胞淋巴瘤(DLBCL)患者预后研究的数据。
基于重采样的预测误差曲线估计在一个名为pec的R包中实现,作者可提供该包。