Suppr超能文献

用于增量频繁项集挖掘的并行和分布式方法。

Parallel and distributed methods for incremental frequent itemset mining.

作者信息

Otey Matthew Eric, Parthasarathy Srinivasan, Wang Chao, Veloso Adriano, Meira Wagner

机构信息

Computer and Information Science Department, The Ohio State University, Columbus, OH 43210, USA.

出版信息

IEEE Trans Syst Man Cybern B Cybern. 2004 Dec;34(6):2439-50. doi: 10.1109/tsmcb.2004.836887.

Abstract

Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication overhead when data is distributed. Efficient implementation of incremental data mining methods is, thus, becoming crucial for ensuring system scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper, we address this issue in the context of the important task of frequent itemset mining. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize this incremental algorithm. We also propose a distributed asynchronous algorithm, which imposes minimal communication overhead for mining distributed dynamic datasets. Our distributed approach is capable of generating local models (in which each site has a summary of its own database) as well as the global model of frequent itemsets (in which all sites have a summary of the entire database). This ability permits our approach not only to generate frequent itemsets, but also to generate high-contrast frequent itemsets, which allows one to examine how the data is skewed over different sites.

摘要

传统的数据挖掘方法通常假定数据是集中式的、驻留在内存中的且是静态的。这种假设已不再成立。当数据是动态的时,此类方法会浪费计算和输入/输出(I/O)资源,而当数据是分布式时,它们会带来过多的通信开销。因此,高效实现增量数据挖掘方法对于确保系统可扩展性以及在数据动态且分布式的情况下促进知识发现变得至关重要。在本文中,我们在频繁项集挖掘这一重要任务的背景下解决此问题。我们首先提出一种高效算法,即使在存在数据更新的情况下,该算法也能动态维护所需信息,而无需检查整个数据集。然后我们展示如何将此增量算法并行化。我们还提出了一种分布式异步算法,该算法在挖掘分布式动态数据集时带来的通信开销最小。我们的分布式方法能够生成局部模型(其中每个站点都有其自身数据库的摘要)以及频繁项集的全局模型(其中所有站点都有整个数据库的摘要)。这种能力使我们的方法不仅能够生成频繁项集,还能生成高对比度频繁项集,从而使人们能够研究数据在不同站点上的倾斜情况。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验