• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过一种新型惩罚对数似然函数改进不平衡数据上的逻辑回归。

Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function.

作者信息

Zhang Lili, Geisler Trent, Ray Herman, Xie Ying

机构信息

Analytics and Data Science Ph.D. Program, Kennesaw State University, Kennesaw, GA, USA.

Analytics and Data Science Institute, Kennesaw State University, Kennesaw, GA, USA.

出版信息

J Appl Stat. 2021 Jun 16;49(13):3257-3277. doi: 10.1080/02664763.2021.1939662. eCollection 2022.

DOI:10.1080/02664763.2021.1939662
PMID:36213775
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9542776/
Abstract

Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective.

摘要

逻辑回归是通过最大化在总体准确率最大化假设下制定的对数似然目标函数来估计的。这不适用于不平衡数据。由此产生的模型往往偏向于多数类(即非事件),这在实际应用中可能会带来巨大损失。减轻这种偏差的一种策略是在对数似然函数中对不同观测值的误分类成本进行惩罚。现有的解决方案要么需要硬超参数估计,要么计算复杂度高。我们提出了一种新颖的惩罚对数似然函数,通过将惩罚权重作为少数类(即事件)观测值的决策变量,并与模型系数一起从数据中学习它们。在实验中,将所提出的逻辑回归模型与现有的模型在来自10个公共数据集和16个模拟数据集的接收器操作特征(ROC)曲线下面积的统计数据以及训练时间方面进行了比较。对一个不平衡信用数据集进行了详细分析,以检查估计的概率分布、额外的性能度量(即I型错误和II型错误)以及模型系数。结果表明,使用所提出的对数似然函数作为学习目标,逻辑回归模型的判别能力和计算效率都得到了提高。

相似文献

1
Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function.通过一种新型惩罚对数似然函数改进不平衡数据上的逻辑回归。
J Appl Stat. 2021 Jun 16;49(13):3257-3277. doi: 10.1080/02664763.2021.1939662. eCollection 2022.
2
A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data.关于不平衡信贷数据的变量离散化和成本敏感逻辑回归的描述性研究。
J Appl Stat. 2019 Jul 23;47(3):568-581. doi: 10.1080/02664763.2019.1643829. eCollection 2020.
3
Penalized Logistic Regression Analysis for Genetic Association Studies of Binary Phenotypes.二元性状遗传关联研究的惩罚逻辑回归分析
Hum Hered. 2022 Jun 29. doi: 10.1159/000525650.
4
On estimation for accelerated failure time models with small or rare event survival data.小样本或稀有事件生存数据的加速失效时间模型估计。
BMC Med Res Methodol. 2022 Jun 11;22(1):169. doi: 10.1186/s12874-022-01638-1.
5
Parameter-Free Loss for Class-Imbalanced Deep Learning in Image Classification.无参数损失的图像分类中深度不平衡学习。
IEEE Trans Neural Netw Learn Syst. 2023 Jun;34(6):3234-3240. doi: 10.1109/TNNLS.2021.3110885. Epub 2023 Jun 1.
6
RSMOTE: improving classification performance over imbalanced medical datasets.RSMOTE:提升不平衡医学数据集的分类性能
Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.
7
A Comparative Study of the Bias Correction Methods for Differential Item Functioning Analysis in Logistic Regression with Rare Events Data.Logistic 回归中稀有事件数据下的项目功能差异分析的偏置校正方法的比较研究。
Biomed Res Int. 2020 Feb 25;2020:1632350. doi: 10.1155/2020/1632350. eCollection 2020.
8
Majorization Minimization by Coordinate Descent for Concave Penalized Generalized Linear Models.基于坐标下降法的凹惩罚广义线性模型的优化最小化
Stat Comput. 2014 Sep;24(5):871-883. doi: 10.1007/s11222-013-9407-3.
9
Estimating haplotype effects on dichotomous outcome for unphased genotype data using a weighted penalized log-likelihood approach.使用加权惩罚对数似然法估计未分型基因型数据对二分结果的单倍型效应。
Hum Hered. 2006;61(2):104-10. doi: 10.1159/000093476. Epub 2006 May 24.
10
An investigation of penalization and data augmentation to improve convergence of generalized estimating equations for clustered binary outcomes.探讨惩罚和数据增强对改善聚类二项结局广义估计方程收敛性的作用。
BMC Med Res Methodol. 2022 Jun 9;22(1):168. doi: 10.1186/s12874-022-01641-6.

引用本文的文献

1
The Role of Pre-surgery Clinical Communication on Metabolic and Bariatric Surgery Outcomes: A Prospective Study.术前临床沟通对代谢与减重手术结局的作用:一项前瞻性研究。
Obes Surg. 2025 Apr;35(4):1223-1233. doi: 10.1007/s11695-025-07772-1. Epub 2025 Mar 13.
2
Exploring the impact of sequence context on errors in SNP genotype calling with whole genome sequencing data using AI-based autoencoder approach.使用基于人工智能的自动编码器方法,利用全基因组测序数据探索序列上下文对单核苷酸多态性(SNP)基因型分型错误的影响。
NAR Genom Bioinform. 2024 Sep 24;6(3):lqae131. doi: 10.1093/nargab/lqae131. eCollection 2024 Sep.
3
Processing imbalanced medical data at the data level with assisted-reproduction data as an example.以辅助生殖数据为例,在数据层面处理不平衡的医学数据。
BioData Min. 2024 Sep 4;17(1):29. doi: 10.1186/s13040-024-00384-y.
4
A Comparative Analysis of Two Automated Quantification Methods for Regional Cerebral Amyloid Retention: PET-Only and PET-and-MRI-Based Methods.两种基于 PET 和 PET-MRI 的区域脑淀粉样蛋白保留自动量化方法的比较分析。
Int J Mol Sci. 2024 Jul 12;25(14):7649. doi: 10.3390/ijms25147649.

本文引用的文献

1
A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data.关于不平衡信贷数据的变量离散化和成本敏感逻辑回归的描述性研究。
J Appl Stat. 2019 Jul 23;47(3):568-581. doi: 10.1080/02664763.2019.1643829. eCollection 2020.
2
A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data.一种基于阈值移动的简单插件式装袋集成方法,用于对二分类和多分类不平衡数据进行分类。
Neurocomputing (Amst). 2018 Jan 31;275:330-340. doi: 10.1016/j.neucom.2017.08.035.
3
On determining the most appropriate test cut-off value: the case of tests with continuous results.关于确定最合适的检测临界值:连续结果检测的情况
Biochem Med (Zagreb). 2016 Oct 15;26(3):297-307. doi: 10.11613/BM.2016.034.
4
Logistic Regression-HSMM-Based Heart Sound Segmentation.基于逻辑回归-隐半马尔可夫模型的心音分割
IEEE Trans Biomed Eng. 2016 Apr;63(4):822-32. doi: 10.1109/TBME.2015.2475278. Epub 2015 Sep 1.
5
Prevalence and predictors of undiagnosed diabetes mellitus in Indonesia.印度尼西亚未诊断糖尿病的患病率及预测因素
Acta Med Indones. 2010 Oct;42(4):216-23.
6
Estimation of the probability of an event as a function of several independent variables.作为多个独立变量的函数对事件概率的估计。
Biometrika. 1967 Jun;54(1):167-79.