• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

关于不平衡信贷数据的变量离散化和成本敏感逻辑回归的描述性研究。

A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data.

作者信息

Zhang Lili, Ray Herman, Priestley Jennifer, Tan Soon

机构信息

Analytics and Data Science Ph.D. Program, Kennesaw State University, Kennesaw, Georgia, USA.

Analytics and Data Science Institute, Kennesaw State University, Kennesaw, Georgia, USA.

出版信息

J Appl Stat. 2019 Jul 23;47(3):568-581. doi: 10.1080/02664763.2019.1643829. eCollection 2020.

DOI:10.1080/02664763.2019.1643829
PMID:35706966
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9041569/
Abstract

Training classification models on imbalanced data tends to result in bias towards the majority class. In this paper, we demonstrate how variable discretization and cost-sensitive logistic regression help mitigate this bias on an imbalanced credit scoring dataset, and further show the application of the variable discretization technique on the data from other domains, demonstrating its potential as a generic technique for classifying imbalanced data beyond credit socring. The performance measurements include ROC curves, Area under ROC Curve (AUC), Type I Error, Type II Error, accuracy, and F1 score. The results show that proper variable discretization and cost-sensitive logistic regression with the best class weights can reduce the model bias and/or variance. From the perspective of the algorithm, cost-sensitive logistic regression is beneficial for increasing the value of predictors even if they are not in their optimized forms while maintaining monotonicity. From the perspective of predictors, the variable discretization performs better than cost-sensitive logistic regression, provides more reasonable coefficient estimates for predictors which have nonlinear relationships against their empirical logit, and is robust to penalty weights on misclassifications of events and non-events determined by their apriori proportions.

摘要

在不平衡数据上训练分类模型往往会导致对多数类的偏向。在本文中,我们展示了变量离散化和成本敏感逻辑回归如何有助于减轻不平衡信用评分数据集上的这种偏差,并进一步展示了变量离散化技术在来自其他领域的数据上的应用,证明了其作为一种通用技术对除信用评分之外的不平衡数据进行分类的潜力。性能度量包括ROC曲线、ROC曲线下面积(AUC)、I型错误、II型错误、准确率和F1分数。结果表明,适当的变量离散化和具有最佳类别权重的成本敏感逻辑回归可以减少模型偏差和/或方差。从算法角度来看,成本敏感逻辑回归有利于增加预测变量的值,即使它们不是处于优化形式,同时保持单调性。从预测变量角度来看,变量离散化比成本敏感逻辑回归表现更好,对于与经验对数似然具有非线性关系的预测变量提供更合理的系数估计,并且对于由先验比例确定的事件和非事件误分类的惩罚权重具有鲁棒性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/0d3f7af0484e/CJAS_A_1643829_F0005_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/6434b83aa7e1/CJAS_A_1643829_F0001_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/56ffc1809cdd/CJAS_A_1643829_F0002_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/6f86dca2257c/CJAS_A_1643829_F0003_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/305224594b4d/CJAS_A_1643829_F0004_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/0d3f7af0484e/CJAS_A_1643829_F0005_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/6434b83aa7e1/CJAS_A_1643829_F0001_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/56ffc1809cdd/CJAS_A_1643829_F0002_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/6f86dca2257c/CJAS_A_1643829_F0003_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/305224594b4d/CJAS_A_1643829_F0004_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c42/9041569/0d3f7af0484e/CJAS_A_1643829_F0005_OC.jpg

相似文献

1
A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data.关于不平衡信贷数据的变量离散化和成本敏感逻辑回归的描述性研究。
J Appl Stat. 2019 Jul 23;47(3):568-581. doi: 10.1080/02664763.2019.1643829. eCollection 2020.
2
Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function.通过一种新型惩罚对数似然函数改进不平衡数据上的逻辑回归。
J Appl Stat. 2021 Jun 16;49(13):3257-3277. doi: 10.1080/02664763.2021.1939662. eCollection 2022.
3
Combination of unsupervised discretization methods for credit risk.无监督离散化方法在信用风险中的组合应用。
PLoS One. 2023 Nov 27;18(11):e0289130. doi: 10.1371/journal.pone.0289130. eCollection 2023.
4
Robust cost-sensitive kernel method with Blinex loss and its applications in credit risk evaluation.具有 Blinex 损失的鲁棒成本敏感核方法及其在信用风险评估中的应用。
Neural Netw. 2021 Nov;143:327-344. doi: 10.1016/j.neunet.2021.06.016. Epub 2021 Jun 16.
5
Using a national surgical database to predict complications following posterior lumbar surgery and comparing the area under the curve and F1-score for the assessment of prognostic capability.利用国家外科手术数据库预测腰椎后路手术后的并发症,并比较曲线下面积和 F1 评分评估预测能力。
Spine J. 2021 Jul;21(7):1135-1142. doi: 10.1016/j.spinee.2021.02.007. Epub 2021 Feb 16.
6
Learning to improve medical decision making from imbalanced data without a priori cost.学习从不均衡数据中改进医疗决策,且无需先验成本。
BMC Med Inform Decis Mak. 2014 Dec 5;14:111. doi: 10.1186/s12911-014-0111-9.
7
Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data.正则化逻辑回归中严重不平衡大数据的稳定变量排名和选择。
PLoS One. 2023 Jan 17;18(1):e0280258. doi: 10.1371/journal.pone.0280258. eCollection 2023.
8
Class prediction for high-dimensional class-imbalanced data.高维类别不平衡数据的类别预测。
BMC Bioinformatics. 2010 Oct 20;11:523. doi: 10.1186/1471-2105-11-523.
9
Resampling methods improve the predictive power of modeling in class-imbalanced datasets.重采样方法提高了类不平衡数据集中建模的预测能力。
Int J Environ Res Public Health. 2014 Sep 18;11(9):9776-89. doi: 10.3390/ijerph110909776.
10
Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data.一种新颖的基于成本敏感的方法,用于提高多层感知器在不平衡数据上的性能。
IEEE Trans Neural Netw Learn Syst. 2013 Jun;24(6):888-99. doi: 10.1109/TNNLS.2013.2246188.

引用本文的文献

1
Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function.通过一种新型惩罚对数似然函数改进不平衡数据上的逻辑回归。
J Appl Stat. 2021 Jun 16;49(13):3257-3277. doi: 10.1080/02664763.2021.1939662. eCollection 2022.
2
Differential Replication for Credit Scoring in Regulated Environments.受监管环境下信用评分的差异复制
Entropy (Basel). 2021 Mar 30;23(4):407. doi: 10.3390/e23040407.

本文引用的文献

1
A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data.一种基于阈值移动的简单插件式装袋集成方法,用于对二分类和多分类不平衡数据进行分类。
Neurocomputing (Amst). 2018 Jan 31;275:330-340. doi: 10.1016/j.neucom.2017.08.035.
2
On determining the most appropriate test cut-off value: the case of tests with continuous results.关于确定最合适的检测临界值:连续结果检测的情况
Biochem Med (Zagreb). 2016 Oct 15;26(3):297-307. doi: 10.11613/BM.2016.034.
3
Large unbalanced credit scoring using Lasso-logistic regression ensemble.
使用套索逻辑回归集成的大规模不平衡信用评分
PLoS One. 2015 Feb 23;10(2):e0117844. doi: 10.1371/journal.pone.0117844. eCollection 2015.
4
Prevalence and predictors of undiagnosed diabetes mellitus in Indonesia.印度尼西亚未诊断糖尿病的患病率及预测因素
Acta Med Indones. 2010 Oct;42(4):216-23.