Ali Misbah, Mazhar Tehseen, Al-Rasheed Amal, Shahzad Tariq, Yasin Ghadi Yazeed, Amir Khan Muhammad
Department of Computer Science & Information Technology, Virtual University of Pakistan, Lahore, Pakistan.
Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
PeerJ Comput Sci. 2024 Feb 28;10:e1860. doi: 10.7717/peerj-cs.1860. eCollection 2024.
Effective software defect prediction is a crucial aspect of software quality assurance, enabling the identification of defective modules before the testing phase. This study aims to propose a comprehensive five-stage framework for software defect prediction, addressing the current challenges in the field. The first stage involves selecting a cleaned version of NASA's defect datasets, including CM1, JM1, MC2, MW1, PC1, PC3, and PC4, ensuring the data's integrity. In the second stage, a feature selection technique based on the genetic algorithm is applied to identify the optimal subset of features. In the third stage, three heterogeneous binary classifiers, namely random forest, support vector machine, and naïve Bayes, are implemented as base classifiers. Through iterative tuning, the classifiers are optimized to achieve the highest level of accuracy individually. In the fourth stage, an ensemble machine-learning technique known as voting is applied as a master classifier, leveraging the collective decision-making power of the base classifiers. The final stage evaluates the performance of the proposed framework using five widely recognized performance evaluation measures: precision, recall, accuracy, F-measure, and area under the curve. Experimental results demonstrate that the proposed framework outperforms state-of-the-art ensemble and base classifiers employed in software defect prediction and achieves a maximum accuracy of 95.1%, showing its effectiveness in accurately identifying software defects. The framework also evaluates its efficiency by calculating execution times. Notably, it exhibits enhanced efficiency, significantly reducing the execution times during the training and testing phases by an average of 51.52% and 52.31%, respectively. This reduction contributes to a more computationally economical solution for accurate software defect prediction.
有效的软件缺陷预测是软件质量保证的关键环节,能够在测试阶段之前识别出有缺陷的模块。本研究旨在提出一个全面的五阶段软件缺陷预测框架,以应对该领域当前面临的挑战。第一阶段涉及选择美国国家航空航天局(NASA)缺陷数据集的清理版本,包括CM1、JM1、MC2、MW1、PC1、PC3和PC4,确保数据的完整性。在第二阶段,应用基于遗传算法的特征选择技术来识别最优特征子集。在第三阶段,将三种异构二分类器,即随机森林、支持向量机和朴素贝叶斯,作为基础分类器来实现。通过迭代调整,对这些分类器进行优化,以分别达到最高的准确率。在第四阶段,应用一种称为投票的集成机器学习技术作为主分类器,利用基础分类器的集体决策能力。最后阶段使用五种广泛认可的性能评估指标:精确率、召回率、准确率、F1值和曲线下面积,来评估所提出框架的性能。实验结果表明,所提出的框架优于软件缺陷预测中使用的现有集成和基础分类器,实现了95.1%的最高准确率,显示出其在准确识别软件缺陷方面的有效性。该框架还通过计算执行时间来评估其效率。值得注意的是,它表现出更高的效率,在训练和测试阶段分别显著减少了平均51.52%和52.31%的执行时间。这种减少为准确的软件缺陷预测提供了一种计算上更经济的解决方案。