Hu Ya-Han, Chen Yen-Liang
Department of Information Management, National Central University, Chung-Li 320, Taiwan, ROC.
Decis Support Syst. 2006 Oct;42(1):1-24. doi: 10.1016/j.dss.2004.09.007. Epub 2004 Nov 30.
Mining association rules with multiple minimum supports is an important generalization of the association-rule-mining problem, which was recently proposed by Liu et al. Instead of setting a single minimum support threshold for all items, they allow users to specify multiple minimum supports to reflect the natures of the items, and an Apriori-based algorithm, named MSapriori, is developed to mine all frequent itemsets. In this paper, we study the same problem but with two additional improvements. First, we propose a FP-tree-like structure, MIS-tree, to store the crucial information about frequent patterns. Accordingly, an efficient MIS-tree-based algorithm, called the CFP-growth algorithm, is developed for mining all frequent itemsets. Second, since each item can have its own minimum support, it is very difficult for users to set the appropriate thresholds for all items at a time. In practice, users need to tune items' supports and run the mining algorithm repeatedly until a satisfactory end is reached. To speed up this time-consuming tuning process, an efficient algorithm which can maintain the MIS-tree structure without rescanning database is proposed. Experiments on both synthetic and real-life datasets show that our algorithms are much more efficient and scalable than the previous algorithm.
挖掘具有多个最小支持度的关联规则是关联规则挖掘问题的一个重要推广,这是Liu等人最近提出的。他们不是为所有项设置单个最小支持度阈值,而是允许用户指定多个最小支持度以反映项的特性,并开发了一种基于Apriori的算法MSapriori来挖掘所有频繁项集。在本文中,我们研究相同的问题,但有两个额外的改进。首先,我们提出一种类似FP树的结构MIS树,用于存储关于频繁模式的关键信息。相应地,开发了一种高效的基于MIS树的算法CFP-growth算法来挖掘所有频繁项集。其次,由于每个项可以有自己的最小支持度,用户很难一次为所有项设置合适的阈值。在实践中,用户需要调整项的支持度并反复运行挖掘算法,直到达到满意的结果。为了加速这个耗时的调整过程,提出了一种高效算法,该算法可以在不重新扫描数据库的情况下维护MIS树结构。在合成数据集和真实数据集上的实验表明,我们的算法比以前的算法更高效且更具可扩展性。