Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, The Netherlands.
Department of Development and Regeneration, KU Leuven, Leuven, Belgium.
J Am Med Inform Assoc. 2022 Aug 16;29(9):1525-1534. doi: 10.1093/jamia/ocac093.
Methods to correct class imbalance (imbalance between the frequency of outcome events and nonevents) are receiving increasing interest for developing prediction models. We examined the effect of imbalance correction on the performance of logistic regression models.
Prediction models were developed using standard and penalized (ridge) logistic regression under 4 methods to address class imbalance: no correction, random undersampling, random oversampling, and SMOTE. Model performance was evaluated in terms of discrimination, calibration, and classification. Using Monte Carlo simulations, we studied the impact of training set size, number of predictors, and the outcome event fraction. A case study on prediction modeling for ovarian cancer diagnosis is presented.
The use of random undersampling, random oversampling, or SMOTE yielded poorly calibrated models: the probability to belong to the minority class was strongly overestimated. These methods did not result in higher areas under the ROC curve when compared with models developed without correction for class imbalance. Although imbalance correction improved the balance between sensitivity and specificity, similar results were obtained by shifting the probability threshold instead.
Imbalance correction led to models with strong miscalibration without better ability to distinguish between patients with and without the outcome event. The inaccurate probability estimates reduce the clinical utility of the model, because decisions about treatment are ill-informed.
Outcome imbalance is not a problem in itself, imbalance correction may even worsen model performance.
为了开发预测模型,校正类别不平衡(结局事件与非事件的频率之间的不平衡)的方法正受到越来越多的关注。我们研究了不平衡校正对逻辑回归模型性能的影响。
使用标准逻辑回归和惩罚(岭)逻辑回归,通过 4 种方法来解决类别不平衡问题:不校正、随机欠采样、随机过采样和 SMOTE。根据判别能力、校准和分类来评估模型性能。使用蒙特卡罗模拟,我们研究了训练集大小、预测变量数量和结局事件分数的影响。呈现了卵巢癌诊断预测模型的案例研究。
随机欠采样、随机过采样或 SMOTE 的使用导致校准不良的模型:属于少数类别的概率被严重高估。与未校正类别不平衡的模型相比,这些方法并没有导致 ROC 曲线下面积更高。尽管不平衡校正提高了敏感性和特异性之间的平衡,但通过转移概率阈值也可以获得类似的结果。
不平衡校正导致模型校准严重错误,而无法更好地区分有无结局事件的患者。不准确的概率估计降低了模型的临床实用性,因为关于治疗的决策是基于不充分的信息。
结局不平衡本身并不是问题,不平衡校正甚至可能会降低模型性能。