Hewanadungodage Chandima, Xia Yuni, Lee Jaehwan John, Tu Yi-Cheng
Department of Computer and Information Science, Indiana University-Purdue University Indianapolis, 723 West Michigan Street, Indianapolis, IN 46202-5132, USA.
Department of Electrical and Computer Engineering, Indiana University-Purdue University Indianapolis, 723 West Michigan Street, Indianapolis, IN 46202-5132, USA.
Knowl Inf Syst. 2013 Oct 1;37(1):219-244. doi: 10.1007/s10115-012-0581-y.
Data uncertainty is inherent in many real-world applications such as sensor monitoring systems, location-based services, and medical diagnostic systems. Moreover, many real-world applications are now capable of producing continuous, unbounded data streams. During the recent years, new methods have been developed to find frequent patterns in uncertain databases; nevertheless, very limited work has been done in discovering frequent patterns in uncertain data streams. The current solutions for frequent pattern mining in uncertain streams take a FP-tree-based approach; however, recent studies have shown that FP-tree-based algorithms do not perform well in the presence of data uncertainty. In this paper, we propose two hyper-structure-based false-positive-oriented algorithms to efficiently mine frequent itemsets from streams of uncertain data. The first algorithm, UHS-Stream, is designed to find all frequent itemsets up to the current moment. The second algorithm, TFUHS-Stream, is designed to find frequent itemsets in an uncertain data stream in a time-fading manner. Experimental results show that the proposed hyper-structure-based algorithms outperform the existing tree-based algorithms in terms of accuracy, runtime, and memory usage.
数据不确定性在许多现实世界的应用中都存在,如传感器监测系统、基于位置的服务和医疗诊断系统。此外,许多现实世界的应用现在能够产生连续的、无界的数据流。近年来,已经开发出了新的方法来在不确定数据库中发现频繁模式;然而,在不确定数据流中发现频繁模式的工作却非常有限。当前用于不确定流中频繁模式挖掘的解决方案采用基于FP树的方法;然而,最近的研究表明,基于FP树的算法在存在数据不确定性的情况下表现不佳。在本文中,我们提出了两种基于超结构的面向误报的算法,以有效地从不确定数据流中挖掘频繁项集。第一种算法UHS-Stream旨在找到截至当前时刻的所有频繁项集。第二种算法TFUHS-Stream旨在以时间衰减的方式在不确定数据流中找到频繁项集。实验结果表明,所提出的基于超结构的算法在准确性、运行时间和内存使用方面优于现有的基于树的算法。