Bacanin Nebojsa, Venkatachalam K, Bezdan Timea, Zivkovic Miodrag, Abouhawwash Mohamed
Singidunum University, Danijelova 32, 11000 Belgrade, Serbia.
Department of Applied Cybernetics, Faculty of Science, University of Hradec Králové, 50003 Hradec Králové, Czech Republic.
Microprocess Microsyst. 2023 Apr;98:104778. doi: 10.1016/j.micpro.2023.104778. Epub 2023 Feb 6.
Feature selection is one of the most important challenges in machine learning and data science. This process is usually performed in the data preprocessing phase, where the data is transformed to a proper format for further operations by machine learning algorithm. Many real-world datasets are highly dimensional with many irrelevant, even redundant features. These kinds of features do not improve classification accuracy and can even shrink down performance of a classifier. The goal of feature selection is to find optimal (or sub-optimal) subset of features that contain relevant information about the dataset from which machine learning algorithms can derive useful conclusions. In this manuscript, a novel version of firefly algorithm (FA) is proposed and adapted for feature selection challenge. Proposed method significantly improves performance of the basic FA, and also outperforms other state-of-the-art metaheuristics for both, benchmark bound-constrained and practical feature selection tasks. Method was first validated on standard unconstrained benchmarks and later it was applied for feature selection by using 21 standard University of California, Irvine (UCL) datasets. Moreover, presented approach was also tested for relatively novel COVID-19 dataset for predicting patients health, and one microcontroller microarray dataset. Results obtained in all practical simulations attest robustness and efficiency of proposed algorithm in terms of convergence, solutions' quality and classification accuracy. More precisely, the proposed approach obtained the best classification accuracy on 13 out of 21 total datasets, significantly outperforming other competitor methods.
特征选择是机器学习和数据科学中最重要的挑战之一。这个过程通常在数据预处理阶段执行,在该阶段,数据被转换为适合机器学习算法进一步操作的格式。许多现实世界的数据集具有高维度,包含许多不相关甚至冗余的特征。这类特征不会提高分类准确率,甚至可能降低分类器的性能。特征选择的目标是找到最优(或次优)的特征子集,这些子集包含有关数据集的相关信息,机器学习算法可以从中得出有用的结论。在本文中,提出了一种新颖的萤火虫算法(FA)版本,并将其应用于特征选择挑战。所提出的方法显著提高了基本萤火虫算法的性能,并且在基准边界约束和实际特征选择任务中均优于其他现有的元启发式算法。该方法首先在标准无约束基准上进行验证,随后通过使用21个加州大学欧文分校(UCI)的标准数据集进行特征选择。此外,还针对相对新颖的用于预测患者健康的COVID-19数据集以及一个微控制器微阵列数据集对所提出的方法进行了测试。在所有实际模拟中获得的结果证明了所提出算法在收敛性、解的质量和分类准确率方面的稳健性和效率。更确切地说,所提出的方法在总共21个数据集中的13个上获得了最佳分类准确率,显著优于其他竞争方法。