Eertink Jakoba J, Heymans Martijn W, Zwezerijnen Gerben J C, Zijlstra Josée M, de Vet Henrica C W, Boellaard Ronald
Department of Hematology, Amsterdam UMC Location Vrije Universiteit Amsterdam, De Boelelaan 1117, 1081 HV, Amsterdam, The Netherlands.
Imaging and Biomarkers, Cancer Center Amsterdam, Amsterdam, The Netherlands.
EJNMMI Res. 2022 Sep 11;12(1):58. doi: 10.1186/s13550-022-00931-w.
Clinical prediction models need to be validated. In this study, we used simulation data to compare various internal and external validation approaches to validate models.
Data of 500 patients were simulated using distributions of metabolic tumor volume, standardized uptake value, the maximal distance between the largest lesion and another lesion, WHO performance status and age of 296 diffuse large B cell lymphoma patients. These data were used to predict progression after 2 years based on an existing logistic regression model. Using the simulated data, we applied cross-validation, bootstrapping and holdout (n = 100). We simulated new external datasets (n = 100, n = 200, n = 500) and simulated stage-specific external datasets (1), varied the cut-off for high-risk patients (2) and the false positive and false negative rates (3) and simulated a dataset with EARL2 characteristics (4). All internal and external simulations were repeated 100 times. Model performance was expressed as the cross-validated area under the curve (CV-AUC ± SD) and calibration slope.
The cross-validation (0.71 ± 0.06) and holdout (0.70 ± 0.07) resulted in comparable model performances, but the model had a higher uncertainty using a holdout set. Bootstrapping resulted in a CV-AUC of 0.67 ± 0.02. The calibration slope was comparable for these internal validation approaches. Increasing the size of the test set resulted in more precise CV-AUC estimates and smaller SD for the calibration slope. For test datasets with different stages, the CV-AUC increased as Ann Arbor stages increased. As expected, changing the cut-off for high risk and false positive- and negative rates influenced the model performance, which is clearly shown by the low calibration slope. The EARL2 dataset resulted in similar model performance and precision, but calibration slope indicated overfitting.
In case of small datasets, it is not advisable to use a holdout or a very small external dataset with similar characteristics. A single small testing dataset suffers from a large uncertainty. Therefore, repeated CV using the full training dataset is preferred instead. Our simulations also demonstrated that it is important to consider the impact of differences in patient population between training and test data, which may ask for adjustment or stratification of relevant variables.
临床预测模型需要进行验证。在本研究中,我们使用模拟数据比较各种内部和外部验证方法来验证模型。
利用296例弥漫性大B细胞淋巴瘤患者的代谢肿瘤体积、标准化摄取值、最大病灶与另一病灶之间的最大距离、世界卫生组织体能状态和年龄的分布情况,模拟了500例患者的数据。这些数据用于基于现有的逻辑回归模型预测2年后的病情进展。利用模拟数据,我们应用了交叉验证、自抽样法和留出法(n = 100)。我们模拟了新的外部数据集(n = 100、n = 200、n = 500)和特定阶段的外部数据集(1),改变高危患者的截断值(2)以及假阳性和假阴性率(3),并模拟了具有EARL2特征的数据集(4)。所有内部和外部模拟均重复100次。模型性能以交叉验证曲线下面积(CV-AUC±标准差)和校准斜率表示。
交叉验证(0.71±0.06)和留出法(0.70±0.07)得出的模型性能相当,但使用留出集时模型的不确定性更高。自抽样法得出的CV-AUC为0.67±0.02。这些内部验证方法的校准斜率相当。增加测试集的大小会导致CV-AUC估计更精确,校准斜率的标准差更小。对于不同阶段的测试数据集,CV-AUC随着Ann Arbor分期的增加而增加。正如预期的那样,改变高危截断值以及假阳性和假阴性率会影响模型性能,校准斜率较低清楚地表明了这一点。EARL2数据集得出的模型性能和精度相似,但校准斜率表明存在过拟合。
在数据集较小的情况下,不建议使用留出法或具有相似特征的非常小的外部数据集。单个小测试数据集存在很大的不确定性。因此,最好使用完整训练数据集重复进行交叉验证。我们的模拟还表明,考虑训练数据和测试数据之间患者群体差异的影响很重要,这可能需要对相关变量进行调整或分层。