Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany.
Sci Rep. 2024 May 21;14(1):11563. doi: 10.1038/s41598-024-62585-z.
Class imbalance is often unavoidable for radiomic data collected from clinical routine. It can create problems during classifier training since the majority class could dominate the minority class. Consequently, resampling methods like oversampling or undersampling are applied to the data to class-balance the data. However, the resampling must not be applied upfront to all data because it would lead to data leakage and, therefore, to erroneous results. This study aims to measure the extent of this bias. Five-fold cross-validation with 30 repeats was performed using a set of 15 radiomic datasets to train predictive models. The training involved two scenarios: first, the models were trained correctly by applying the resampling methods during the cross-validation. Second, the models were trained incorrectly by performing the resampling on all the data before cross-validation. The bias was defined empirically as the difference between the best-performing models in both scenarios in terms of area under the receiver operating characteristic curve (AUC), sensitivity, specificity, balanced accuracy, and the Brier score. In addition, a simulation study was performed on a randomly generated dataset for verification. The results demonstrated that incorrectly applying the oversampling methods to all data resulted in a large positive bias (up to 0.34 in AUC, 0.33 in sensitivity, 0.31 in specificity, and 0.37 in balanced accuracy). The bias depended on the data balance, and approximately an increase of 0.10 in the AUC was observed for each increase in imbalance. The models also showed a bias in calibration measured using the Brier score, which differed by up to -0.18 between the correctly and incorrectly trained models. The undersampling methods were not affected significantly by bias. These results emphasize that any resampling method should be applied correctly only to the training data to avoid data leakage and, subsequently, biased model performance and calibration.
在从临床常规中收集的放射组学数据中,类别不平衡通常是不可避免的。由于大多数类别可能会主导少数类别,因此在分类器训练过程中会出现问题。因此,应用重采样方法(如过采样或欠采样)对数据进行类别平衡。然而,重采样不能应用于所有数据,因为它会导致数据泄漏,从而导致错误的结果。本研究旨在衡量这种偏差的程度。使用一组 15 个放射组学数据集进行了五次交叉验证,重复 30 次,以训练预测模型。训练涉及两种情况:首先,通过在交叉验证期间应用重采样方法,正确地训练模型。其次,在进行交叉验证之前,通过对所有数据进行重采样,不正确地训练模型。偏差是通过在两种情况下最佳性能模型的差异来定义的,最佳性能模型的评估指标包括接收者操作特征曲线(AUC)下的面积、敏感性、特异性、平衡准确性和 Brier 评分。此外,还在随机生成的数据集上进行了模拟研究以进行验证。结果表明,将过采样方法错误地应用于所有数据会导致较大的正偏差(AUC 高达 0.34,敏感性为 0.33,特异性为 0.31,平衡准确性为 0.37)。偏差取决于数据的平衡,在 AUC 每增加 0.10 的情况下,大约会观察到 0.10 的增加。模型在使用 Brier 评分测量校准方面也存在偏差,正确和不正确训练的模型之间的差异最大可达-0.18。欠采样方法不受偏差的显著影响。这些结果强调,任何重采样方法都应仅正确应用于训练数据,以避免数据泄漏,从而避免模型性能和校准的偏差。