一种用于同时进行实例和特征选择的可扩展Memetic算法。

A scalable memetic algorithm for simultaneous instance and feature selection.

作者信息

García-Pedrajas Nicolás, de Haro-García Aida, Pérez-Rodríguez Javier

机构信息

Department of Computing and Numerical Analysis, University of Cordoba, Córdoba, 14014, Spain

出版信息

Evol Comput. 2014 Spring;22(1):1-45. doi: 10.1162/EVCO_a_00102. Epub 2013 Aug 8.

DOI:10.1162/EVCO_a_00102

PMID:23544367

Abstract

Instance selection is becoming increasingly relevant due to the huge amount of data that is constantly produced in many fields of research. At the same time, most of the recent pattern recognition problems involve highly complex datasets with a large number of possible explanatory variables. For many reasons, this abundance of variables significantly harms classification or recognition tasks. There are efficiency issues, too, because the speed of many classification algorithms is largely improved when the complexity of the data is reduced. One of the approaches to address problems that have too many features or instances is feature or instance selection, respectively. Although most methods address instance and feature selection separately, both problems are interwoven, and benefits are expected from facing these two tasks jointly. This paper proposes a new memetic algorithm for dealing with many instances and many features simultaneously by performing joint instance and feature selection. The proposed method performs four different local search procedures with the aim of obtaining the most relevant subsets of instances and features to perform an accurate classification. A new fitness function is also proposed that enforces instance selection but avoids putting too much pressure on removing features. We prove experimentally that this fitness function improves the results in terms of testing error. Regarding the scalability of the method, an extension of the stratification approach is developed for simultaneous instance and feature selection. This extension allows the application of the proposed algorithm to large datasets. An extensive comparison using 55 medium to large datasets from the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 30 large problems, with very good results. The accuracy of the method for class-imbalanced problems in a set of 40 datasets is shown. The usefulness of the method is also tested using decision trees and support vector machines as classification methods.

摘要

由于在许多研究领域中不断产生海量数据，实例选择变得越来越重要。与此同时，最近的大多数模式识别问题都涉及具有大量可能解释变量的高度复杂数据集。由于多种原因，这种大量的变量会严重损害分类或识别任务。还存在效率问题，因为当数据复杂度降低时，许多分类算法的速度会大幅提高。解决具有过多特征或实例问题的方法之一分别是特征选择或实例选择。尽管大多数方法分别处理实例和特征选择，但这两个问题相互交织，预期联合处理这两个任务会带来好处。本文提出了一种新的混合算法，通过联合进行实例和特征选择来同时处理大量实例和大量特征。所提出的方法执行四种不同的局部搜索过程，目的是获得最相关的实例和特征子集以进行准确分类。还提出了一种新的适应度函数，该函数强制进行实例选择，但避免对去除特征施加过大压力。我们通过实验证明，该适应度函数在测试误差方面改善了结果。关于该方法的可扩展性，开发了分层方法的扩展以进行同时的实例和特征选择。这种扩展允许将所提出的算法应用于大型数据集。使用来自UCI机器学习库的55个中大型数据集进行的广泛比较显示了我们方法的有效性。此外，该方法应用于30个大型问题，取得了非常好的结果。展示了该方法在一组40个数据集中处理类别不平衡问题的准确性。还使用决策树和支持向量机作为分类方法测试了该方法的有效性。