Peek N, Arts D G T, Bosman R J, van der Voort P H J, de Keizer N F
Department of Medical Informatics, Academic Medical Center--Universiteit van Amsterdam, Amsterdam, the Netherlands.
J Clin Epidemiol. 2007 May;60(5):491-501. doi: 10.1016/j.jclinepi.2006.08.011. Epub 2007 Feb 5.
To investigate the behavior of predictive performance measures that are commonly used in external validation of prognostic models for outcome at intensive care units (ICUs).
Four prognostic models (Simplified Acute Physiology Score II, the Acute Physiology and Chronic Health Evaluation II, and the Mortality Probability Models II) were evaluated in the Dutch National Intensive Care Evaluation registry database. For each model discrimination (AUC), accuracy (Brier score), and two calibration measures were assessed on data from 41,239 ICU admissions. This validation procedure was repeated with smaller subsamples randomly drawn from the database, and the results were compared with those obtained on the entire data set.
Differences in performance between the models were small. The AUC and Brier score showed large variation with small samples. Standard errors of AUC values were accurate but the power to detect differences in performance was low. Calibration tests were extremely sensitive to sample size. Direct comparison of performance, without statistical analysis, was unreliable with either measure.
Substantial sample sizes are required for performance assessment and model comparison in external validation. Calibration statistics and significance tests should not be used in these settings. Instead, a simple customization method to repair lack-of-fit problems is recommended.
研究重症监护病房(ICU)预后模型外部验证中常用的预测性能指标的表现。
在荷兰国家重症监护评估注册数据库中评估了四种预后模型(简化急性生理学评分II、急性生理学与慢性健康状况评估II以及死亡概率模型II)。针对每个模型,对41239例ICU入院患者的数据评估了区分度(AUC)、准确性(Brier评分)以及两种校准指标。使用从数据库中随机抽取的较小子样本重复此验证过程,并将结果与在整个数据集上获得的结果进行比较。
模型之间的性能差异很小。AUC和Brier评分在小样本时显示出很大的变异性。AUC值的标准误差准确,但检测性能差异的效能较低。校准测试对样本量极其敏感。无论使用哪种指标,在不进行统计分析的情况下直接比较性能都是不可靠的。
在外部验证中进行性能评估和模型比较需要大量样本。在这些情况下不应使用校准统计和显著性检验。相反,建议采用一种简单的定制方法来修复拟合不足问题。