• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用抽样指标将不良数据语料库转换为精准输出。

Conversion of adverse data corpus to shrewd output using sampling metrics.

作者信息

Ashraf Shahzad, Saleem Sehrish, Ahmed Tauqeer, Aslam Zeeshan, Muhammad Durr

机构信息

College of Internet of Things Engineering, Hohai University, Changzhou, Jiangsu, 210032, China.

Muhammad Nawaz Sharif University of Engineering & Technology, Multan, 66000, Pakistan.

出版信息

Vis Comput Ind Biomed Art. 2020 Aug 11;3(1):19. doi: 10.1186/s42492-020-00055-9.

DOI:10.1186/s42492-020-00055-9
PMID:32779031
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7417470/
Abstract

An imbalanced dataset is commonly found in at least one class, which are typically exceeded by the other ones. A machine learning algorithm (classifier) trained with an imbalanced dataset predicts the majority class (frequently occurring) more than the other minority classes (rarely occurring). Training with an imbalanced dataset poses challenges for classifiers; however, applying suitable techniques for reducing class imbalance issues can enhance classifiers' performance. In this study, we consider an imbalanced dataset from an educational context. Initially, we examine all shortcomings regarding the classification of an imbalanced dataset. Then, we apply data-level algorithms for class balancing and compare the performance of classifiers. The performance of the classifiers is measured using the underlying information in their confusion matrices, such as accuracy, precision, recall, and F measure. The results show that classification with an imbalanced dataset may produce high accuracy but low precision and recall for the minority class. The analysis confirms that undersampling and oversampling are effective for balancing datasets, but the latter dominates.

摘要

不平衡数据集通常至少在一个类别中出现,该类别中的样本数量通常少于其他类别。使用不平衡数据集训练的机器学习算法(分类器)对多数类(频繁出现)的预测多于其他少数类(很少出现)。使用不平衡数据集进行训练给分类器带来了挑战;然而,应用适当的技术来减少类不平衡问题可以提高分类器的性能。在本研究中,我们考虑来自教育背景的不平衡数据集。首先,我们研究了关于不平衡数据集分类的所有缺点。然后,我们应用数据级算法进行类平衡,并比较分类器的性能。分类器的性能使用其混淆矩阵中的基础信息来衡量,例如准确率、精确率、召回率和F值。结果表明,使用不平衡数据集进行分类可能会产生较高的准确率,但少数类的精确率和召回率较低。分析证实,欠采样和过采样对于平衡数据集是有效的,但后者更占优势。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/b5229866a005/42492_2020_55_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/c0fabc27667c/42492_2020_55_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/ac85a402d720/42492_2020_55_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/e95e08891edf/42492_2020_55_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/6bdd9bb80813/42492_2020_55_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/b622312f56ac/42492_2020_55_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/820d527c994b/42492_2020_55_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/da86d71eb424/42492_2020_55_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/c9fa773abf3c/42492_2020_55_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/b5229866a005/42492_2020_55_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/c0fabc27667c/42492_2020_55_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/ac85a402d720/42492_2020_55_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/e95e08891edf/42492_2020_55_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/6bdd9bb80813/42492_2020_55_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/b622312f56ac/42492_2020_55_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/820d527c994b/42492_2020_55_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/da86d71eb424/42492_2020_55_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/c9fa773abf3c/42492_2020_55_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35f/7417470/b5229866a005/42492_2020_55_Fig9_HTML.jpg

相似文献

1
Conversion of adverse data corpus to shrewd output using sampling metrics.使用抽样指标将不良数据语料库转换为精准输出。
Vis Comput Ind Biomed Art. 2020 Aug 11;3(1):19. doi: 10.1186/s42492-020-00055-9.
2
Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques.使用计算智能技术处理类不平衡临床数据集上的二元分类问题。
Healthcare (Basel). 2022 Jul 13;10(7):1293. doi: 10.3390/healthcare10071293.
3
Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.基于结构-活性关系的高度不平衡Tox21数据集的化学分类
J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.
4
A comprehensive data level analysis for cancer diagnosis on imbalanced data.针对不平衡数据进行癌症诊断的全面数据级别分析。
J Biomed Inform. 2019 Feb;90:103089. doi: 10.1016/j.jbi.2018.12.003. Epub 2019 Jan 3.
5
Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.评估和缓解机器学习中类不平衡的影响及其在 X 射线成像中的应用。
Int J Comput Assist Radiol Surg. 2020 Dec;15(12):2041-2048. doi: 10.1007/s11548-020-02260-6. Epub 2020 Sep 23.
6
Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.机器学习中不平衡数据集的重采样技术比较:在局灶性癫痫患者发作间期颅内脑电图记录的致痫区定位中的应用
Front Neuroinform. 2021 Nov 19;15:715421. doi: 10.3389/fninf.2021.715421. eCollection 2021.
7
Effect of machine learning re-sampling techniques for imbalanced datasets in F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients.基于F-FDG PET的放射组学模型中机器学习重采样技术对不平衡数据集的处理对头颈癌患者队列预后性能的影响。
Eur J Nucl Med Mol Imaging. 2020 Nov;47(12):2826-2835. doi: 10.1007/s00259-020-04756-4. Epub 2020 Apr 6.
8
An empirical evaluation of sampling methods for the classification of imbalanced data.不平衡数据分类的采样方法的实证评估。
PLoS One. 2022 Jul 28;17(7):e0271260. doi: 10.1371/journal.pone.0271260. eCollection 2022.
9
A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification.一种用于不平衡分类的噪声滤波欠采样方案。
IEEE Trans Cybern. 2017 Dec;47(12):4263-4274. doi: 10.1109/TCYB.2016.2606104. Epub 2016 Oct 12.
10
Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification.基于自适应群体聚类的动态多目标合成少数类过采样技术算法,用于处理生物医学数据分类中的二元不平衡数据集。
BioData Min. 2016 Dec 1;9:37. doi: 10.1186/s13040-016-0117-1. eCollection 2016.

引用本文的文献

1
A New Method of Deep Convolutional Neural Network Image Classification Based on Knowledge Transfer in Small Label Sample Environment.基于小标签样本环境下知识迁移的深度卷积神经网络图像分类新方法。
Sensors (Basel). 2022 Jan 25;22(3):898. doi: 10.3390/s22030898.

本文引用的文献

1
A systematic study of the class imbalance problem in convolutional neural networks.卷积神经网络中类不平衡问题的系统研究。
Neural Netw. 2018 Oct;106:249-259. doi: 10.1016/j.neunet.2018.07.011. Epub 2018 Jul 29.