Biometric Research Branch, US National Cancer Institute, Bethesda, MD 20892-7434, USA.
Brief Bioinform. 2011 May;12(3):203-14. doi: 10.1093/bib/bbr001. Epub 2011 Feb 15.
Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell's concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.
全基因组生物技术的发展激发了统计学对预测方法的关注。我们在这里回顾了将患者分类为生存风险组的方法,并使用交叉验证来评估此类分类。生存风险模型的判别措施包括生存曲线的分离、时间依赖性 ROC 曲线和 Harrell 的一致性指数。然而,对于高维数据应用,在用于模型开发的数据上计算这些措施作为重新替代统计数据会导致高度有偏的估计。用于高维数据生存风险建模的方法学的大多数发展都利用了单独的测试数据集来评估模型。交叉验证有时用于调整参数的优化。然而,在许多应用中,可用的数据太少,无法有效地分为训练集和测试集,因此作者通常要么报告重新替代统计数据,要么使用二进制分类方法分析其数据,以便利用熟悉的交叉验证。在本文中,我们试图指出如何利用交叉验证来评估生存风险模型;具体来说,如何计算预测风险组的交叉验证估计生存分布,以及如何计算交叉验证时间依赖性 ROC 曲线。我们还讨论了生存风险模型的统计显著性评估,以及评估高维基因组数据是否仅基于标准协变量为模型增加预测准确性。