Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US. Food and Drug Administration, Jefferson, AR 72079, USA.
BMC Med Res Methodol. 2012 Jul 23;12:102. doi: 10.1186/1471-2288-12-102.
Cancer survival studies are commonly analyzed using survival-time prediction models for cancer prognosis. A number of different performance metrics are used to ascertain the concordance between the predicted risk score of each patient and the actual survival time, but these metrics can sometimes conflict. Alternatively, patients are sometimes divided into two classes according to a survival-time threshold, and binary classifiers are applied to predict each patient's class. Although this approach has several drawbacks, it does provide natural performance metrics such as positive and negative predictive values to enable unambiguous assessments.
We compare the survival-time prediction and survival-time threshold approaches to analyzing cancer survival studies. We review and compare common performance metrics for the two approaches. We present new randomization tests and cross-validation methods to enable unambiguous statistical inferences for several performance metrics used with the survival-time prediction approach. We consider five survival prediction models consisting of one clinical model, two gene expression models, and two models from combinations of clinical and gene expression models.
A public breast cancer dataset was used to compare several performance metrics using five prediction models. 1) For some prediction models, the hazard ratio from fitting a Cox proportional hazards model was significant, but the two-group comparison was insignificant, and vice versa. 2) The randomization test and cross-validation were generally consistent with the p-values obtained from the standard performance metrics. 3) Binary classifiers highly depended on how the risk groups were defined; a slight change of the survival threshold for assignment of classes led to very different prediction results.
癌症生存研究通常使用癌症预后的生存时间预测模型进行分析。有许多不同的性能指标用于确定每个患者的预测风险评分与实际生存时间之间的一致性,但这些指标有时会存在冲突。或者,有时根据生存时间阈值将患者分为两类,并应用二分类器预测每个患者的类别。虽然这种方法有几个缺点,但它确实提供了阳性和阴性预测值等自然性能指标,从而可以进行明确的评估。
我们比较了生存时间预测和生存时间阈值两种方法来分析癌症生存研究。我们回顾并比较了这两种方法的常用性能指标。我们提出了新的随机化检验和交叉验证方法,以便对生存时间预测方法中使用的几个性能指标进行明确的统计推断。我们考虑了由一个临床模型、两个基因表达模型和两个临床与基因表达模型组合模型组成的五个生存预测模型。
使用五个预测模型,对公共乳腺癌数据集进行了比较,以评估多个性能指标。1)对于某些预测模型,拟合 Cox 比例风险模型的风险比是显著的,但两组比较不显著,反之亦然。2)随机化检验和交叉验证通常与从标准性能指标获得的 p 值一致。3)二分类器高度依赖于如何定义风险组;风险组分配类别的生存阈值稍有变化,就会导致非常不同的预测结果。
1)评估生存预测模型的不同性能指标可能会对其区分能力得出不同的结论。2)使用高风险与低风险组比较的评估取决于所选的风险评分阈值;绘制所有可能阈值的 p 值图可以显示阈值选择的敏感性。3)可以使用 Somers 等级相关的随机检验来进一步评估预测模型的性能。4)随着训练集和测试集的不平衡程度增加,生存预测模型的交叉验证能力会降低。