IEEE Trans Cybern. 2016 Mar;46(3):595-608. doi: 10.1109/TCYB.2015.2410143. Epub 2015 Mar 18.
Discretization is one of the most relevant techniques for data preprocessing. The main goal of discretization is to transform numerical attributes into discrete ones to help the experts to understand the data more easily, and it also provides the possibility to use some learning algorithms which require discrete data as input, such as Bayesian or rule learning. We focus our attention on handling multivariate classification problems, where high interactions among multiple attributes exist. In this paper, we propose the use of evolutionary algorithms to select a subset of cut points that defines the best possible discretization scheme of a data set using a wrapper fitness function. We also incorporate a reduction mechanism to successfully manage the multivariate approach on large data sets. Our method has been compared with the best state-of-the-art discretizers on 45 real datasets. The experiments show that our proposed algorithm overcomes the rest of the methods producing competitive discretization schemes in terms of accuracy, for C4.5, Naive Bayes, PART, and PrUning and BuiLding Integrated in Classification classifiers; and obtained far simpler solutions.
离散化是数据预处理中最相关的技术之一。离散化的主要目的是将数值属性转换为离散属性,以帮助专家更轻松地理解数据,并且还提供了使用一些需要离散数据作为输入的学习算法的可能性,例如贝叶斯或规则学习。我们专注于处理多变量分类问题,其中多个属性之间存在高度交互。在本文中,我们提出使用进化算法选择一组切点,这些切点使用包装器适应度函数定义数据集的最佳可能离散化方案。我们还结合了一种减少机制,以成功地在大数据集上处理多变量方法。我们的方法已经在 45 个真实数据集上与最好的最新离散化器进行了比较。实验表明,我们提出的算法克服了其他方法,在 C4.5、朴素贝叶斯、PART 和 Pruning 和 Building Integrated in Classification 分类器方面,在准确性方面产生了有竞争力的离散化方案,并且得到了更简单的解决方案。