Grzymala-Busse Jerzy W, Mroczek Teresa
Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA.
Department of Expert Systems and Artificial Intelligence, University of Information Technology and Management, Rzeszow 35-225, Poland.
Entropy (Basel). 2018 Nov 16;20(11):880. doi: 10.3390/e20110880.
As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the domains of the data records. In multiple-scanning discretization, the last step is the merging of neighboring intervals in discretized datasets as a kind of postprocessing. Our objective is to check how the error rate, measured by tenfold cross validation within the C4.5 system, is affected by such merging. We conducted experiments on 17 numerical datasets, using the same setup of multiple scanning, with three different options for merging: no merging at all, merging based on the smallest entropy, and merging based on the biggest entropy. As a result of the Friedman rank sum test (5% significance level) we concluded that the differences between all three approaches are statistically insignificant. There is no universally best approach. Then, we repeated all experiments 30 times, recording averages and standard deviations. The test of the difference between averages shows that, for a comparison of no merging with merging based on the smallest entropy, there are statistically highly significant differences (with a 1% significance level). In some cases, the smaller error rate is associated with no merging, in some cases the smaller error rate is associated with merging based on the smallest entropy. A comparison of no merging with merging based on the biggest entropy showed similar results. So, our final conclusion was that there are highly significant differences between no merging and merging, depending on the dataset. The best approach should be chosen by trying all three approaches.
如先前的研究所示,一种基于熵的用于数值数据集离散化的多重扫描方法极具竞争力。离散化是将数据记录的数值转换为与在数据记录域上定义的数值区间相关联的离散值的过程。在多重扫描离散化中,最后一步是作为一种后处理操作,合并离散数据集中相邻的区间。我们的目标是检验在C4.5系统内通过十折交叉验证测量的错误率是如何受到这种合并的影响的。我们使用相同的多重扫描设置,对17个数值数据集进行了实验,合并有三种不同的选项:完全不合并、基于最小熵合并以及基于最大熵合并。作为Friedman秩和检验(显著性水平为5%)的结果,我们得出结论,所有三种方法之间的差异在统计学上不显著。不存在普遍最佳的方法。然后,我们将所有实验重复30次,记录平均值和标准差。平均值差异检验表明,对于不合并与基于最小熵合并的比较,存在统计学上高度显著的差异(显著性水平为1%)。在某些情况下,较小的错误率与不合并相关,在某些情况下,较小的错误率与基于最小熵合并相关。不合并与基于最大熵合并的比较显示了类似的结果。所以,我们的最终结论是,不合并与合并之间存在高度显著的差异,这取决于数据集。应该通过尝试所有三种方法来选择最佳方法。