Qian Guoqi, Rao Calyampudi Radhakrishna, Sun Xiaoying, Wu Yuehua
School of Mathematics and Statistics, The University of Melbourne, Parkville, VIC 3010, Australia;
Department of Biostatistics, University at Buffalo, The State University of New York, Buffalo, NY 14221-3000; CRRAO Advanced Institute of Mathematics, Statistics and Computer Science, Hyderabad-500046, India;
Proc Natl Acad Sci U S A. 2016 May 3;113(18):4958-63. doi: 10.1073/pnas.1604553113. Epub 2016 Apr 18.
Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.
当前用于从交易数据中挖掘关联规则的算法大多是确定性的和枚举式的。如果不采取措施来限制搜索空间,即使对于挖掘一个只包含几百个交易项的数据集,它们在计算上也可能难以处理。在本文中,我们开发了一种由吉布斯采样引发的随机搜索过程,以便从项集空间中随机采样关联规则,并从由该采样生成的简化交易数据集中执行规则挖掘。此外,还提出了一种通用的规则重要性度量来指导随机搜索,使得由于随机生成的关联规则构成一个遍历马尔可夫链,在极限情况下可以从简化数据集中以概率1发现项集空间中总体最重要的规则。在模拟研究和一个真实基因组数据示例中,我们展示了如何通过随机搜索和Apriori算法的综合使用来促进关联规则挖掘。