IEEE Trans Cybern. 2019 Feb;49(2):403-416. doi: 10.1109/TCYB.2017.2774266. Epub 2017 Dec 4.
Traditional ensemble learning approaches explore the feature space and the sample space, respectively, which will prevent them to construct more powerful learning models for noisy real-world dataset classification. The random subspace method only search for the selection of features. Meanwhile, the bagging approach only search for the selection of samples. To overcome these limitations, we propose the hybrid incremental ensemble learning (HIEL) approach which takes into consideration the feature space and the sample space simultaneously to handle noisy dataset. Specifically, HIEL first adopts the bagging technique and linear discriminant analysis to remove noisy attributes, and generates a set of bootstraps and the corresponding ensemble members in the subspaces. Then, the classifiers are selected incrementally based on a classifier-specific criterion function and an ensemble criterion function. The corresponding weights for the classifiers are assigned during the same process. Finally, the final label is summarized by a weighted voting scheme, which serves as the final result of the classification. We also explore various classifier-specific criterion functions based on different newly proposed similarity measures, which will alleviate the effect of noisy samples on the distance functions. In addition, the computational cost of HIEL is analyzed theoretically. A set of nonparametric tests are adopted to compare HIEL and other algorithms over several datasets. The experiment results show that HIEL performs well on the noisy datasets. HIEL outperforms most of the compared classifier ensemble methods on 14 out of 24 noisy real-world UCI and KEEL datasets.
传统的集成学习方法分别探索特征空间和样本空间,这将阻止它们为嘈杂的现实世界数据集分类构建更强大的学习模型。随机子空间方法仅搜索特征的选择。同时,装袋方法仅搜索样本的选择。为了克服这些限制,我们提出了混合增量集成学习 (HIEL) 方法,该方法同时考虑特征空间和样本空间,以处理嘈杂的数据集。具体来说,HIEL 首先采用装袋技术和线性判别分析来去除嘈杂属性,并在子空间中生成一组引导和相应的集成成员。然后,根据特定于分类器的准则函数和集成准则函数,逐步选择分类器。在同一过程中,为分类器分配相应的权重。最后,通过加权投票方案总结最终标签,作为分类的最终结果。我们还探索了各种基于新提出的相似性度量的特定于分类器的准则函数,这将减轻嘈杂样本对距离函数的影响。此外,还从理论上分析了 HIEL 的计算成本。采用了一组非参数检验来比较 HIEL 和其他算法在多个数据集上的性能。实验结果表明,HIEL 在嘈杂数据集上表现良好。在 24 个嘈杂的 UCI 和 KEEL 数据集的 14 个数据集上,HIEL 优于大多数比较的分类器集成方法。