Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, Box 44, New York, NY 10065 USA.
BMC Med Res Methodol. 2011 Jan 28;11:13. doi: 10.1186/1471-2288-11-13.
We have observed that the area under the receiver operating characteristic curve (AUC) is increasingly being used to evaluate whether a novel predictor should be incorporated in a multivariable model to predict risk of disease. Frequently, investigators will approach the issue in two distinct stages: first, by testing whether the new predictor variable is significant in a multivariable regression model; second, by testing differences between the AUC of models with and without the predictor using the same data from which the predictive models were derived. These two steps often lead to discordant conclusions.
We conducted a simulation study in which two predictors, X and X*, were generated as standard normal variables with varying levels of predictive strength, represented by means that differed depending on the binary outcome Y. The data sets were analyzed using logistic regression, and likelihood ratio and Wald tests for the incremental contribution of X* were performed. The patient-specific predictors for each of the models were then used as data for a test comparing the two AUCs. Under the null, the size of the likelihood ratio and Wald tests were close to nominal, but the area test was extremely conservative, with test sizes less than 0.006 for all configurations studied. Where X* was associated with outcome, the area test had much lower power than the likelihood ratio and Wald tests.
Evaluation of the statistical significance of a new predictor when there are existing clinical predictors is most appropriately accomplished in the context of a regression model. Although comparison of AUCs is a conceptually equivalent approach to the likelihood ratio and Wald test, it has vastly inferior statistical properties. Use of both approaches will frequently lead to inconsistent conclusions. Nonetheless, comparison of receiver operating characteristic curves remains a useful descriptive tool for initial evaluation of whether a new predictor might be of clinical relevance.
我们已经观察到,接收器工作特性曲线(AUC)下的面积越来越多地被用于评估新的预测因子是否应该纳入多变量模型以预测疾病风险。通常,研究人员会分两个阶段来解决这个问题:首先,通过检验新预测变量在多变量回归模型中的显著性;其次,使用从预测模型得出的数据来检验有无预测因子的模型的 AUC 之间的差异。这两个步骤常常导致不一致的结论。
我们进行了一项模拟研究,其中两个预测因子 X 和 X* 作为标准正态变量生成,具有不同的预测强度水平,其均值取决于二项式结果 Y。使用逻辑回归分析数据集,并对 X* 的增量贡献进行似然比和 Wald 检验。然后,将每个模型的患者特定预测因子用作比较两个 AUC 的检验数据。在零假设下,似然比和 Wald 检验的大小接近名义值,但面积检验非常保守,在所有研究的配置中,检验大小均小于 0.006。当 X*与结果相关时,面积检验的功效远低于似然比和 Wald 检验。
当存在现有临床预测因子时,评估新预测因子的统计显著性最适合在回归模型的背景下进行。尽管比较 AUC 是似然比和 Wald 检验的概念上等效方法,但它具有较差的统计性质。两种方法的使用通常会导致不一致的结论。尽管如此,比较接收器工作特性曲线仍然是评估新预测因子是否具有临床相关性的有用描述性工具。