Jeyasingh Suganthi, Veluchamy Malathi
Department of Computer Science and Engineering, Raja College of Engineering and Technology, Madurai, Tamilnadu, India. Email:
Asian Pac J Cancer Prev. 2017 May 1;18(5):1257-1264. doi: 10.22034/APJCP.2017.18.5.1257.
Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE).
乳腺癌的早期诊断对于挽救患者生命至关重要。通常,医学数据集包含各种各样的数据,这可能会在诊断过程中导致混淆。数据库知识发现(KDD)过程有助于提高效率。在最终诊断之前,需要从数据集中消除不适当和重复的数据。这可以使用数据挖掘中可用的任何特征选择算法来完成。特征选择被视为提高分类准确性的关键步骤。本文提出了一种改进的蝙蝠算法(MBA)用于特征选择,以从原始数据集中消除无关特征。通过简单随机抽样对蝙蝠算法进行修改,以从数据集中选择随机实例。通过全局最佳特征进行排序,以识别数据集中可用的主要特征。所选特征用于训练随机森林(RF)分类算法。MBA特征选择算法提高了RF在识别乳腺癌发生方面的分类准确性。使用威斯康星诊断乳腺癌数据集(WDBC)来估计所提出的MBA特征选择算法的性能分析。所提出的算法在卡帕统计量、马修斯相关系数、精度、F值、召回率、平均绝对误差(MAE)、均方根误差(RMSE)、相对绝对误差(RAE)和根相对平方误差(RRSE)方面取得了更好的性能。