• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

不平衡数据分类的采样方法的实证评估。

An empirical evaluation of sampling methods for the classification of imbalanced data.

机构信息

Department of Computer Science and Engineering, Graduate School, Soongsil University, Seoul, Korea.

出版信息

PLoS One. 2022 Jul 28;17(7):e0271260. doi: 10.1371/journal.pone.0271260. eCollection 2022.

DOI:10.1371/journal.pone.0271260
PMID:35901023
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9333262/
Abstract

In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.

摘要

在许多分类问题中,类别的分布并不均衡。例如,在疾病诊断和信用卡欺诈检测等领域,正例很少。众所周知,对于这种不平衡的分类,一般的机器学习方法效果不佳。一种流行的解决方案是在应用机器学习算法之前,通过对代表性不足的类别进行过采样(或对代表性过高的类别进行欠采样)来平衡训练数据。然而,尽管这种方法很流行,但采样的有效性并没有得到严格和全面的评估。本研究使用了 31 个具有不同不平衡程度的数据集,评估了七种采样方法和八种机器学习分类器(总共 56 种组合)。我们使用精度-召回率曲线下面积(AUPRC)和接收者操作特征曲线(AUROC)作为性能指标。与 AUROC 相比,AUPRC 更适合不平衡分类。我们发现,只有在少数情况下(AUPRC 中为 12.2%,AUROC 中为 10.0%),采样才会显著改变分类器的性能(配对 t 检验 P < 0.05)。令人惊讶的是,采样更有可能降低而不是提高分类性能。此外,采样对 AUPRC 的不利影响比 AUROC 更明显。在采样方法中,欠采样的效果比其他方法差。此外,采样更有利于提高线性分类器的性能。最重要的是,我们不需要采样就能为大多数 31 个数据集找到最优的分类器。此外,我们还发现了两个有趣的例子,在这两个例子中,采样显著降低了 AUPRC,同时显著提高了 AUROC(配对 t 检验 P < 0.05)。总之,采样的适用性有限,因为它可能无效甚至有害。此外,性能指标的选择对决策至关重要。我们的研究结果为不平衡分类中的采样效果和特征提供了有价值的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/4d6852ac741c/pone.0271260.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/d152306524d7/pone.0271260.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/3b1bc95e4de0/pone.0271260.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/48c4cbad3e56/pone.0271260.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/1ca864b79b1c/pone.0271260.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/1dcfa3307591/pone.0271260.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/82b413ad8569/pone.0271260.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/4d6852ac741c/pone.0271260.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/d152306524d7/pone.0271260.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/3b1bc95e4de0/pone.0271260.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/48c4cbad3e56/pone.0271260.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/1ca864b79b1c/pone.0271260.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/1dcfa3307591/pone.0271260.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/82b413ad8569/pone.0271260.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f2c/9333262/4d6852ac741c/pone.0271260.g007.jpg

相似文献

1
An empirical evaluation of sampling methods for the classification of imbalanced data.不平衡数据分类的采样方法的实证评估。
PLoS One. 2022 Jul 28;17(7):e0271260. doi: 10.1371/journal.pone.0271260. eCollection 2022.
2
Establishment of noninvasive diabetes risk prediction model based on tongue features and machine learning techniques.基于舌象特征和机器学习技术的无创糖尿病风险预测模型的建立。
Int J Med Inform. 2021 May;149:104429. doi: 10.1016/j.ijmedinf.2021.104429. Epub 2021 Feb 22.
3
Conversion of adverse data corpus to shrewd output using sampling metrics.使用抽样指标将不良数据语料库转换为精准输出。
Vis Comput Ind Biomed Art. 2020 Aug 11;3(1):19. doi: 10.1186/s42492-020-00055-9.
4
The performance of VCS(volume, conductivity, light scatter) parameters in distinguishing latent tuberculosis and active tuberculosis by using machine learning algorithm.使用机器学习算法区分潜伏性结核和活动性结核的 VCS(体积、传导率、光散射)参数的性能。
BMC Infect Dis. 2023 Dec 16;23(1):881. doi: 10.1186/s12879-023-08531-2.
5
A soft voting ensemble learning approach for credit card fraud detection.一种用于信用卡欺诈检测的软投票集成学习方法。
Heliyon. 2024 Feb 1;10(3):e25466. doi: 10.1016/j.heliyon.2024.e25466. eCollection 2024 Feb 15.
6
Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study.分析不平衡数据的采样技术:一项 n = 648 的 ADNI 研究。
Neuroimage. 2014 Feb 15;87:220-41. doi: 10.1016/j.neuroimage.2013.10.005. Epub 2013 Oct 29.
7
Classifying adverse drug reactions from imbalanced twitter data.从不平衡的推特数据中分类药物不良反应。
Int J Med Inform. 2019 Sep;129:122-132. doi: 10.1016/j.ijmedinf.2019.05.017. Epub 2019 May 30.
8
Using machine learning to predict opioid misuse among U.S. adolescents.利用机器学习预测美国青少年阿片类药物滥用。
Prev Med. 2020 Jan;130:105886. doi: 10.1016/j.ypmed.2019.105886. Epub 2019 Nov 6.
9
Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.机器学习中不平衡数据集的重采样技术比较:在局灶性癫痫患者发作间期颅内脑电图记录的致痫区定位中的应用
Front Neuroinform. 2021 Nov 19;15:715421. doi: 10.3389/fninf.2021.715421. eCollection 2021.
10
Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.评估和缓解机器学习中类不平衡的影响及其在 X 射线成像中的应用。
Int J Comput Assist Radiol Surg. 2020 Dec;15(12):2041-2048. doi: 10.1007/s11548-020-02260-6. Epub 2020 Sep 23.

引用本文的文献

1
A Machine Learning Approach for Identifying People With Neuroinfectious Diseases in Electronic Health Records: Algorithm Development and Validation.一种用于在电子健康记录中识别神经感染性疾病患者的机器学习方法:算法开发与验证
JMIR Med Inform. 2025 Aug 29;13:e63157. doi: 10.2196/63157.
2
Linear B-cell epitope prediction for SARS and COVID-19 vaccine design: Integrating balanced ensemble learning models and resampling strategies.用于SARS和COVID-19疫苗设计的线性B细胞表位预测:集成平衡集成学习模型和重采样策略
PeerJ Comput Sci. 2025 Jun 18;11:e2970. doi: 10.7717/peerj-cs.2970. eCollection 2025.
3
Privacy-Preserving Federated Learning Framework for Multi-Source Electronic Health Records Prognosis Prediction.

本文引用的文献

1
A systematic study of the class imbalance problem in convolutional neural networks.卷积神经网络中类不平衡问题的系统研究。
Neural Netw. 2018 Oct;106:249-259. doi: 10.1016/j.neunet.2018.07.011. Epub 2018 Jul 29.
2
Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?为什么重新平衡类不平衡数据会提高线性判别分析的 AUC?
IEEE Trans Pattern Anal Mach Intell. 2015 May;37(5):1109-12. doi: 10.1109/TPAMI.2014.2359660.
3
The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.
用于多源电子健康记录预后预测的隐私保护联邦学习框架
Sensors (Basel). 2025 Apr 9;25(8):2374. doi: 10.3390/s25082374.
4
Addressing imbalanced data classification with Cluster-Based Reduced Noise SMOTE.基于聚类的降噪合成少数过采样技术解决不平衡数据分类问题
PLoS One. 2025 Feb 10;20(2):e0317396. doi: 10.1371/journal.pone.0317396. eCollection 2025.
5
Constructing a machine learning model for systemic infection after kidney stone surgery based on CT values.基于CT值构建肾结石手术后全身感染的机器学习模型。
Sci Rep. 2025 Feb 5;15(1):4327. doi: 10.1038/s41598-025-88704-y.
6
Performance of Conditional Random Forest and Regression Models at Predicting Human Fecal Contamination of Produce Irrigation Ponds in the Southeastern United States.条件随机森林和回归模型在美国东南部预测农产品灌溉池塘人粪便污染方面的性能
ACS ES T Water. 2024 Nov 27;4(12):5844-5855. doi: 10.1021/acsestwater.4c00839.
7
Integrative Analysis of ATAC-Seq and RNA-Seq through Machine Learning Identifies 10 Signature Genes for Breast Cancer Intrinsic Subtypes.通过机器学习对ATAC-Seq和RNA-Seq进行综合分析,鉴定出乳腺癌内在亚型的10个特征基因。
Biology (Basel). 2024 Oct 7;13(10):799. doi: 10.3390/biology13100799.
8
Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis.从不平衡数据中学习:先进重采样技术与机器学习模型的整合用于增强癌症诊断与预后
Cancers (Basel). 2024 Oct 8;16(19):3417. doi: 10.3390/cancers16193417.
9
Predicting Short Time-to-Crime Guns: a Machine Learning Analysis of California Transaction Records (2010-2021).预测短期犯罪枪支:对加利福尼亚交易记录(2010-2021 年)的机器学习分析。
J Urban Health. 2024 Oct;101(5):955-967. doi: 10.1007/s11524-024-00909-0. Epub 2024 Sep 5.
10
Comprehensive Overview of Bottom-Up Proteomics Using Mass Spectrometry.基于质谱的自下而上蛋白质组学综合概述
ACS Meas Sci Au. 2024 Jun 4;4(4):338-417. doi: 10.1021/acsmeasuresciau.3c00068. eCollection 2024 Aug 21.
在不平衡数据集上评估二元分类器时,精确率-召回率曲线比ROC曲线更具信息性。
PLoS One. 2015 Mar 4;10(3):e0118432. doi: 10.1371/journal.pone.0118432. eCollection 2015.
4
Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance.训练用于医学决策的神经网络分类器:不均衡数据集对分类性能的影响。
Neural Netw. 2008 Mar-Apr;21(2-3):427-36. doi: 10.1016/j.neunet.2007.12.031. Epub 2007 Dec 27.
5
Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection.利用非线性递归和分形标度特性进行语音障碍检测。
Biomed Eng Online. 2007 Jun 26;6:23. doi: 10.1186/1475-925X-6-23.
6
Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.用于比较监督分类学习算法的近似统计检验
Neural Comput. 1998 Sep 15;10(7):1895-1923. doi: 10.1162/089976698300017197.