IEEE Trans Cybern. 2023 Jul;53(7):4579-4593. doi: 10.1109/TCYB.2021.3128540. Epub 2023 Jun 15.
Feature selection aims to reduce the number of features and improve the classification accuracy, which is an essential step in many real-world problems. Multiple feature subsets with different features selected can achieve similar or even the same objective values (e.g., maximize the classification accuracy and minimize the number of selected features). This means the optimal feature subsets of a classification problem may not be unique. However, most existing feature selection methods do not take into consideration finding multiple optimal feature subsets. In this article, a multiobjective differential evolution approach is developed to search for multiple optimal feature subsets. The contributions are three-fold. First, to provide a good starting point, an initialization method considering feature relevance is proposed. Second, a clustering method is used to divide the whole population into multiple subpopulations. In each of these subpopulations, a subarchive utilizes a developed crowding distance to ensure diversity by considering both the search space and the objective space. Finally, the nondominated solutions from all the subarchives are retained in another archive to guide the evolutionary feature selection process, together with an improved hypervolume contribution indicator. The experiments on 14 datasets of varying difficulty show that the proposed approach can evolve a better Pareto front of feature subsets compared with seven other state-of-the-art methods as well as find different feature subsets with similar or the same classification performance.
特征选择旨在减少特征数量并提高分类准确性,这是许多实际问题中的重要步骤。选择具有不同特征的多个特征子集可以达到相似甚至相同的目标值(例如,最大化分类准确性和最小化选择的特征数量)。这意味着分类问题的最优特征子集可能不是唯一的。然而,大多数现有的特征选择方法并没有考虑到寻找多个最优特征子集。本文提出了一种多目标差分进化方法来搜索多个最优特征子集。贡献有三点。首先,为了提供一个良好的起点,提出了一种考虑特征相关性的初始化方法。其次,使用聚类方法将整个种群划分为多个子种群。在这些子种群中的每一个中,一个子档案利用开发的拥挤距离,通过考虑搜索空间和目标空间,来确保多样性。最后,从所有子档案中保留非支配解到另一个档案中,以指导进化特征选择过程,并结合改进的超体积贡献指标。在 14 个具有不同难度的数据集上的实验表明,与其他七种最先进的方法相比,所提出的方法可以进化出更好的特征子集 Pareto 前沿,并且可以找到具有相似或相同分类性能的不同特征子集。