Otey Matthew Eric, Parthasarathy Srinivasan, Wang Chao, Veloso Adriano, Meira Wagner
Computer and Information Science Department, The Ohio State University, Columbus, OH 43210, USA.
IEEE Trans Syst Man Cybern B Cybern. 2004 Dec;34(6):2439-50. doi: 10.1109/tsmcb.2004.836887.
Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication overhead when data is distributed. Efficient implementation of incremental data mining methods is, thus, becoming crucial for ensuring system scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper, we address this issue in the context of the important task of frequent itemset mining. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize this incremental algorithm. We also propose a distributed asynchronous algorithm, which imposes minimal communication overhead for mining distributed dynamic datasets. Our distributed approach is capable of generating local models (in which each site has a summary of its own database) as well as the global model of frequent itemsets (in which all sites have a summary of the entire database). This ability permits our approach not only to generate frequent itemsets, but also to generate high-contrast frequent itemsets, which allows one to examine how the data is skewed over different sites.
传统的数据挖掘方法通常假定数据是集中式的、驻留在内存中的且是静态的。这种假设已不再成立。当数据是动态的时,此类方法会浪费计算和输入/输出(I/O)资源,而当数据是分布式时,它们会带来过多的通信开销。因此,高效实现增量数据挖掘方法对于确保系统可扩展性以及在数据动态且分布式的情况下促进知识发现变得至关重要。在本文中,我们在频繁项集挖掘这一重要任务的背景下解决此问题。我们首先提出一种高效算法,即使在存在数据更新的情况下,该算法也能动态维护所需信息,而无需检查整个数据集。然后我们展示如何将此增量算法并行化。我们还提出了一种分布式异步算法,该算法在挖掘分布式动态数据集时带来的通信开销最小。我们的分布式方法能够生成局部模型(其中每个站点都有其自身数据库的摘要)以及频繁项集的全局模型(其中所有站点都有整个数据库的摘要)。这种能力使我们的方法不仅能够生成频繁项集,还能生成高对比度频繁项集,从而使人们能够研究数据在不同站点上的倾斜情况。