IEEE Trans Cybern. 2018 Sep;48(9):2656-2669. doi: 10.1109/TCYB.2017.2748225. Epub 2017 Sep 19.
Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes. Then, a set of candidate fuzzy association rules is generated by employing a distributed fuzzy extension of the well-known FP-Growth algorithm. Finally, this set is pruned by using three purposely adapted types of pruning. We implemented our approach on the popular Hadoop framework. Hadoop allows distributing storage and processing of very large data sets on computer clusters built from commodity hardware. We have performed an extensive experimentation and a detailed analysis of the results using six very large datasets with up to 11 000 000 instances. We have also experimented different types of reasoning methods. Focusing on accuracy, model complexity, computation time, and scalability, we compare the results achieved by our approach with those obtained by two distributed nonfuzzy ACs recently proposed in the literature. We highlight that, although the accuracies result to be comparable, the complexity, evaluated in terms of number of rules, of the classifiers generated by the fuzzy distributed approach is lower than the one of the nonfuzzy classifiers.
模糊关联分类在文献中尚未得到广泛分析,尽管关联分类器(AC)已被证明在不同的实际领域应用中非常有效。主要原因是学习模糊 AC 是一项非常繁重的任务,尤其是在处理大型数据集时。为了克服这一缺点,本文提出了一种基于 MapReduce 范例的高效分布式模糊关联分类方法。该方法利用基于模糊熵的新型分布式离散化方法,有效地生成属性的模糊分区。然后,通过采用著名的 FP-Growth 算法的分布式模糊扩展生成一组候选模糊关联规则。最后,通过使用三种专门设计的修剪类型来修剪此集合。我们在流行的 Hadoop 框架上实现了我们的方法。Hadoop 允许在由商用硬件构建的计算机集群上分布存储和处理非常大的数据集。我们使用六个具有多达 1100 万实例的非常大的数据集进行了广泛的实验和详细的结果分析。我们还尝试了不同类型的推理方法。重点关注准确性、模型复杂性、计算时间和可扩展性,我们将我们的方法所取得的结果与文献中最近提出的两种分布式非模糊 AC 所取得的结果进行了比较。我们强调,尽管准确性结果相当,但模糊分布式方法生成的分类器的复杂性(以规则数为衡量标准)要低于非模糊分类器的复杂性。