Zhang Lili, Geisler Trent, Ray Herman, Xie Ying
Analytics and Data Science Ph.D. Program, Kennesaw State University, Kennesaw, GA, USA.
Analytics and Data Science Institute, Kennesaw State University, Kennesaw, GA, USA.
J Appl Stat. 2021 Jun 16;49(13):3257-3277. doi: 10.1080/02664763.2021.1939662. eCollection 2022.
Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective.
逻辑回归是通过最大化在总体准确率最大化假设下制定的对数似然目标函数来估计的。这不适用于不平衡数据。由此产生的模型往往偏向于多数类(即非事件),这在实际应用中可能会带来巨大损失。减轻这种偏差的一种策略是在对数似然函数中对不同观测值的误分类成本进行惩罚。现有的解决方案要么需要硬超参数估计,要么计算复杂度高。我们提出了一种新颖的惩罚对数似然函数,通过将惩罚权重作为少数类(即事件)观测值的决策变量,并与模型系数一起从数据中学习它们。在实验中,将所提出的逻辑回归模型与现有的模型在来自10个公共数据集和16个模拟数据集的接收器操作特征(ROC)曲线下面积的统计数据以及训练时间方面进行了比较。对一个不平衡信用数据集进行了详细分析,以检查估计的概率分布、额外的性能度量(即I型错误和II型错误)以及模型系数。结果表明,使用所提出的对数似然函数作为学习目标,逻辑回归模型的判别能力和计算效率都得到了提高。