Vergouwe Yvonne, Steyerberg Ewout W, Eijkemans Marinus J C, Habbema J Dik F
Department of Public Health, Erasmus MC, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands.
J Clin Epidemiol. 2005 May;58(5):475-83. doi: 10.1016/j.jclinepi.2004.06.017.
The performance of a prediction model is usually worse in external validation data compared to the development data. We aimed to determine at which effective sample sizes (i.e., number of events) relevant differences in model performance can be detected with adequate power.
We used a logistic regression model to predict the probability that residual masses of patients treated for metastatic testicular cancer contained only benign tissue. We performed standard power calculations and Monte Carlo simulations to estimate the numbers of events that are required to detect several types of model invalidity with 80% power at the 5% significance level.
A validation sample with 111 events was required to detect that a model predicted too high probabilities, when predictions were on average 1.5 times too high on the odds scale. A decrease in discriminative ability of the model, indicated by a decrease in the c-statistic from 0.83 to 0.73, required 81 to 106 events, depending on the specific scenario.
We suggest a minimum of 100 events and 100 nonevents for external validation samples. Specific hypotheses may, however, require substantially higher effective sample sizes to obtain adequate power.
与开发数据相比,预测模型在外部验证数据中的表现通常更差。我们旨在确定在何种有效样本量(即事件数)下,能够以足够的检验效能检测到模型性能的相关差异。
我们使用逻辑回归模型来预测接受转移性睾丸癌治疗的患者残留肿块仅包含良性组织的概率。我们进行了标准的效能计算和蒙特卡洛模拟,以估计在5%显著性水平下,以80%的检验效能检测几种类型的模型无效性所需的事件数。
当预测在优势比尺度上平均高1.5倍时,需要一个包含111个事件的验证样本才能检测到模型预测的概率过高。根据具体情况,模型判别能力的下降(由c统计量从0.83降至0.73表示)需要81至106个事件。
我们建议外部验证样本的事件数最少为100个,非事件数最少为100个。然而,特定的假设可能需要实质上更高的有效样本量才能获得足够的检验效能。