Zhang Lili, Ray Herman, Priestley Jennifer, Tan Soon
Analytics and Data Science Ph.D. Program, Kennesaw State University, Kennesaw, Georgia, USA.
Analytics and Data Science Institute, Kennesaw State University, Kennesaw, Georgia, USA.
J Appl Stat. 2019 Jul 23;47(3):568-581. doi: 10.1080/02664763.2019.1643829. eCollection 2020.
Training classification models on imbalanced data tends to result in bias towards the majority class. In this paper, we demonstrate how variable discretization and cost-sensitive logistic regression help mitigate this bias on an imbalanced credit scoring dataset, and further show the application of the variable discretization technique on the data from other domains, demonstrating its potential as a generic technique for classifying imbalanced data beyond credit socring. The performance measurements include ROC curves, Area under ROC Curve (AUC), Type I Error, Type II Error, accuracy, and F1 score. The results show that proper variable discretization and cost-sensitive logistic regression with the best class weights can reduce the model bias and/or variance. From the perspective of the algorithm, cost-sensitive logistic regression is beneficial for increasing the value of predictors even if they are not in their optimized forms while maintaining monotonicity. From the perspective of predictors, the variable discretization performs better than cost-sensitive logistic regression, provides more reasonable coefficient estimates for predictors which have nonlinear relationships against their empirical logit, and is robust to penalty weights on misclassifications of events and non-events determined by their apriori proportions.
在不平衡数据上训练分类模型往往会导致对多数类的偏向。在本文中,我们展示了变量离散化和成本敏感逻辑回归如何有助于减轻不平衡信用评分数据集上的这种偏差,并进一步展示了变量离散化技术在来自其他领域的数据上的应用,证明了其作为一种通用技术对除信用评分之外的不平衡数据进行分类的潜力。性能度量包括ROC曲线、ROC曲线下面积(AUC)、I型错误、II型错误、准确率和F1分数。结果表明,适当的变量离散化和具有最佳类别权重的成本敏感逻辑回归可以减少模型偏差和/或方差。从算法角度来看,成本敏感逻辑回归有利于增加预测变量的值,即使它们不是处于优化形式,同时保持单调性。从预测变量角度来看,变量离散化比成本敏感逻辑回归表现更好,对于与经验对数似然具有非线性关系的预测变量提供更合理的系数估计,并且对于由先验比例确定的事件和非事件误分类的惩罚权重具有鲁棒性。