Malhotra Ruchika, Jain Juhi
Department of Software Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, India.
Department of Computer Science and Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, India.
PeerJ Comput Sci. 2022 Apr 29;8:e573. doi: 10.7717/peerj-cs.573. eCollection 2022.
The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators-AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods.
开发正确且有效的软件缺陷预测(SDP)模型是软件行业的迫切需求之一。许多与缺陷相关的开源数据集的统计数据表明,面向对象项目中存在类不平衡问题。在不平衡数据上训练的模型会由于有偏差的学习和无效的缺陷预测而导致未来预测不准确。此外,大量的软件度量会降低模型性能。本研究旨在:(1)使用相关特征选择来识别软件中的有用度量;(2)对10种重采样方法进行广泛的比较分析,以生成针对不平衡数据的有效机器学习模型;(3)纳入稳定的性能评估指标——AUC、GMean和平衡度;(4)对结果进行统计验证。使用15种机器学习技术,分析了10种重采样方法对12个面向对象的Apache数据集的选定特征的影响。使用AUC、GMean、平衡度和灵敏度来分析所开发模型的性能。统计结果支持使用重采样方法来改进软件缺陷预测。随机过采样展现出所开发的缺陷预测模型最佳的预测能力。该研究为识别对软件缺陷预测有影响的度量提供了指导方针。过采样方法的性能优于欠采样方法。