Sahebi Golnaz, Movahedi Parisa, Ebrahimi Masoumeh, Pahikkala Tapio, Plosila Juha, Tenhunen Hannu
Department of Future Technologies, University of Turku, Turku, FI-20014, Turun yliopisto, Finland.
Department of Future Technologies, University of Turku, Turku, FI-20014, Turun yliopisto, Finland.
Comput Biol Med. 2020 Oct;125:103974. doi: 10.1016/j.compbiomed.2020.103974. Epub 2020 Aug 20.
In this paper, we propose a generalized wrapper-based feature selection, called GeFeS, which is based on a parallel new intelligent genetic algorithm (GA). The proposed GeFeS works properly under different numerical dataset dimensions and sizes, carefully tries to avoid overfitting and significantly enhances classification accuracy. To make the GA more accurate, robust and intelligent, we have proposed a new operator for features weighting, improved the mutation and crossover operators, and integrated nested cross-validation into the GA process to properly validate the learning model. The k-nearest neighbor (kNN) classifier is utilized to evaluate the goodness of selected features. We have evaluated the efficiency of GeFeS on various datasets selected from the UCI machine learning repository. The performance is compared with state-of-the-art classification and feature selection methods. The results demonstrate that GeFeS can significantly generalize the proposed multi-population intelligent genetic algorithm under different sizes of two-class and multi-class datasets. We have achieved the average classification accuracy of 95.83%, 97.62%, 99.02%, 98.51%, and 94.28% while reducing the number of features from 56 to 28, 34 to 18, 279 to 135, 30 to 16, and 19 to 9 under lung cancer, dermatology, arrhythmia, WDBC, and hepatitis, respectively.
在本文中,我们提出了一种基于包装器的广义特征选择方法,称为GeFeS,它基于一种并行的新型智能遗传算法(GA)。所提出的GeFeS在不同的数值数据集维度和大小下都能正常工作,仔细避免过拟合,并显著提高分类准确率。为了使GA更准确、更稳健和更智能,我们提出了一种新的特征加权算子,改进了变异和交叉算子,并将嵌套交叉验证集成到GA过程中以正确验证学习模型。使用k近邻(kNN)分类器来评估所选特征的优劣。我们在从UCI机器学习库中选择的各种数据集上评估了GeFeS的效率。将其性能与当前最先进的分类和特征选择方法进行了比较。结果表明,GeFeS可以在不同大小的二类和多类数据集下显著推广所提出的多种群智能遗传算法。在肺癌、皮肤病学、心律失常、WDBC和肝炎数据集上,我们分别将特征数量从56个减少到28个、34个减少到18个、279个减少到135个、30个减少到16个以及19个减少到9个,同时实现了95.83%、97.62%、99.02%、98.51%和94.28%的平均分类准确率。