使用重采样方法预测不平衡数据中的缺陷：一项实证研究。

Predicting defects in imbalanced data using resampling methods: an empirical investigation.

作者信息

Malhotra Ruchika, Jain Juhi

机构信息

Department of Software Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, India.

Department of Computer Science and Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, India.

出版信息

PeerJ Comput Sci. 2022 Apr 29;8:e573. doi: 10.7717/peerj-cs.573. eCollection 2022.

DOI:10.7717/peerj-cs.573

PMID:35634102

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9137963/

Abstract

The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators-AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods.

摘要

开发正确且有效的软件缺陷预测（SDP）模型是软件行业的迫切需求之一。许多与缺陷相关的开源数据集的统计数据表明，面向对象项目中存在类不平衡问题。在不平衡数据上训练的模型会由于有偏差的学习和无效的缺陷预测而导致未来预测不准确。此外，大量的软件度量会降低模型性能。本研究旨在：（1）使用相关特征选择来识别软件中的有用度量；（2）对10种重采样方法进行广泛的比较分析，以生成针对不平衡数据的有效机器学习模型；（3）纳入稳定的性能评估指标——AUC、GMean和平衡度；（4）对结果进行统计验证。使用15种机器学习技术，分析了10种重采样方法对12个面向对象的Apache数据集的选定特征的影响。使用AUC、GMean、平衡度和灵敏度来分析所开发模型的性能。统计结果支持使用重采样方法来改进软件缺陷预测。随机过采样展现出所开发的缺陷预测模型最佳的预测能力。该研究为识别对软件缺陷预测有影响的度量提供了指导方针。过采样方法的性能优于欠采样方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/935d/9137963/7d320c1d9bf4/peerj-cs-08-573-g001.jpg

相似文献

Predicting defects in imbalanced data using resampling methods: an empirical investigation.使用重采样方法预测不平衡数据中的缺陷：一项实证研究。

PeerJ Comput Sci. 2022 Apr 29;8:e573. doi: 10.7717/peerj-cs.573. eCollection 2022.

Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.机器学习中不平衡数据集的重采样技术比较：在局灶性癫痫患者发作间期颅内脑电图记录的致痫区定位中的应用

Front Neuroinform. 2021 Nov 19;15:715421. doi: 10.3389/fninf.2021.715421. eCollection 2021.

KCO: Balancing class distribution in just-in-time software defect prediction using kernel crossover oversampling.KCO：使用核交叉过采样平衡及时软件缺陷预测中的类分布。

PLoS One. 2024 Apr 11;19(4):e0299585. doi: 10.1371/journal.pone.0299585. eCollection 2024.

Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data.重采样数据以解决类别不平衡问题的影响（IRCIP）：医学数据中分类算法间性能影响的评估

JAMIA Open. 2023 May 31;6(2):ooad033. doi: 10.1093/jamiaopen/ooad033. eCollection 2023 Jul.

Comparative Studies on Resampling Techniques in Machine Learning and Deep Learning Models for Drug-Target Interaction Prediction.机器学习和深度学习模型中用于药物-靶标相互作用预测的重采样技术的比较研究。

Molecules. 2023 Feb 9;28(4):1663. doi: 10.3390/molecules28041663.

Understanding random resampling techniques for class imbalance correction and their consequences on calibration and discrimination of clinical risk prediction models.理解随机重采样技术在类别不平衡校正中的应用及其对临床风险预测模型校准和区分的影响。

J Biomed Inform. 2024 Jul;155:104666. doi: 10.1016/j.jbi.2024.104666. Epub 2024 Jun 6.

Prediction of low Apgar score at five minutes following labor induction intervention in vaginal deliveries: machine learning approach for imbalanced data at a tertiary hospital in North Tanzania.坦桑尼亚北部一家三级医院分娩时行引产干预后 5 分钟低 Apgar 评分的预测：不平衡数据的机器学习方法。

BMC Pregnancy Childbirth. 2022 Apr 1;22(1):275. doi: 10.1186/s12884-022-04534-0.

Effect of machine learning re-sampling techniques for imbalanced datasets in F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients.基于F-FDG PET的放射组学模型中机器学习重采样技术对不平衡数据集的处理对头颈癌患者队列预后性能的影响。

Eur J Nucl Med Mol Imaging. 2020 Nov;47(12):2826-2835. doi: 10.1007/s00259-020-04756-4. Epub 2020 Apr 6.

Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis.从不平衡数据中学习：先进重采样技术与机器学习模型的整合用于增强癌症诊断与预后

Cancers (Basel). 2024 Oct 8;16(19):3417. doi: 10.3390/cancers16193417.

Machine Learning Models for Predicting Influential Factors of Early Outcomes in Acute Ischemic Stroke: Registry-Based Study.用于预测急性缺血性卒中早期预后影响因素的机器学习模型：基于登记处的研究

JMIR Med Inform. 2022 Mar 25;10(3):e32508. doi: 10.2196/32508.

引用本文的文献

Feature selection based on neighborhood rough sets and Gini index.基于邻域粗糙集和基尼指数的特征选择

PeerJ Comput Sci. 2023 Dec 12;9:e1711. doi: 10.7717/peerj-cs.1711. eCollection 2023.

本文引用的文献

Predicting disease risks from highly imbalanced data using random forest.基于随机森林算法从高度不平衡数据中预测疾病风险。

BMC Med Inform Decis Mak. 2011 Jul 29;11:51. doi: 10.1186/1472-6947-11-51.

Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance.训练用于医学决策的神经网络分类器：不均衡数据集对分类性能的影响。

Neural Netw. 2008 Mar-Apr;21(2-3):427-36. doi: 10.1016/j.neunet.2007.12.031. Epub 2007 Dec 27.

Learning from imbalanced data in surveillance of nosocomial infection.从医院感染监测中的不均衡数据中学习。

Artif Intell Med. 2006 May;37(1):7-18. doi: 10.1016/j.artmed.2005.03.002. Epub 2005 Oct 17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用重采样方法预测不平衡数据中的缺陷：一项实证研究。

Predicting defects in imbalanced data using resampling methods: an empirical investigation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献