Suppr超能文献

密集数据中的快速挖掘:深度优先顺序下的概率支持预测应用

Quick mining in dense data: applying probabilistic support prediction in depth-first order.

作者信息

Sadeequllah Muhammad, Rauf Azhar, Rehman Saif Ur, Alnazzawi Noha

机构信息

Department of Computer Science, University of Peshawar, Peshawar, KP, Pakistan.

Computer Science and Engineering Department, Yanbu Industrial College, Yanbu, Saudi Arabia.

出版信息

PeerJ Comput Sci. 2024 Oct 4;10:e2334. doi: 10.7717/peerj-cs.2334. eCollection 2024.

Abstract

Frequent itemset mining (FIM) is a major component in association rule mining, significantly influencing its performance. FIM is a computationally intensive nondeterministic polynomial time (NP)-hard problem. At the core of FIM is the task of computing support of candidate itemsets. This problem becomes more severe when the dataset is dense as the support is computed for millions, or even billions, of candidate itemsets. The rapid growth of data further exacerbates this problem. To achieve high scalability and efficiency, recently, researchers have proposed various approaches to approximate the support of an itemset using as small a subset of transaction data as possible. In addition to efficiency, accuracy is another important metric for these algorithms. They strive to increase true positives and reduce false negatives and false positives. One such recently proposed approximate FIM algorithm is Probabilistic Breadth-First (ProbBF), which is highly efficient for dense data due to its unique approach of not using transactional data beyond 2-size itemsets. Unlike other counterparts, this algorithm requires no additional input parameters beyond the traditional support threshold. However, ProbBF is a breadth-first algorithm, and it is well-established that breadth-first FIM algorithms consume significantly more memory than depth-first algorithms on dense datasets. It is also worth noting that significantly high memory consumption slows run-time performance of an algorithm due to low utilization of locality of reference, thrashing, and aggressive garbage collection . This article proposes a FIM algorithm, ProbDF, that discards transaction data after determining all frequent itemsets of sizes one and two. For frequent itemsets of size three or more, it employs a probabilistic support prediction model (PSPM) to predict their support probabilistically. PSPM, first proposed with ProbBF, uses lightweight calculations that exclude transaction data. Our experiments demonstrate that ProbDF, with its depth-first search strategy tailored to PSPM and other optimizations, is efficient in terms of time and space, and successfully generates the majority of frequent itemsets on real-world benchmark datasets. However, due to the probabilistic nature of ProbDF, some compromise in quality is inevitable.

摘要

频繁项集挖掘(FIM)是关联规则挖掘中的一个主要组成部分,对其性能有重大影响。FIM是一个计算密集型的非确定性多项式时间(NP)难题。FIM的核心任务是计算候选项集的支持度。当数据集很密集时,这个问题会变得更加严重,因为要为数百万甚至数十亿个候选项集计算支持度。数据的快速增长进一步加剧了这个问题。为了实现高可扩展性和效率,最近研究人员提出了各种方法,尽可能使用少量的事务数据子集来近似项集的支持度。除了效率之外,准确性是这些算法的另一个重要指标。它们努力增加真阳性,减少假阴性和假阳性。最近提出的一种近似FIM算法是概率广度优先(ProbBF)算法,由于其独特的方法,即不使用超过2项集的事务数据,因此对于密集数据非常高效。与其他同类算法不同,该算法除了传统的支持度阈值外,不需要其他输入参数。然而,ProbBF是一种广度优先算法,众所周知,在密集数据集上,广度优先FIM算法比深度优先算法消耗更多的内存。还值得注意的是,由于引用局部性利用率低、抖动和积极的垃圾回收,显著高的内存消耗会减慢算法的运行时性能。本文提出了一种FIM算法ProbDF,该算法在确定了所有大小为一和二的频繁项集后丢弃事务数据。对于大小为三或更大的频繁项集,它采用概率支持预测模型(PSPM)来概率性地预测它们的支持度。PSPM最初是与ProbBF一起提出的,使用不包括事务数据的轻量级计算。我们的实验表明,ProbDF通过其针对PSPM量身定制的深度优先搜索策略和其他优化,在时间和空间方面都很高效,并且在真实世界的基准数据集上成功生成了大多数频繁项集。然而,由于ProbDF的概率性质,质量上的一些折衷是不可避免的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/93f2/11623086/b477b9bf93ca/peerj-cs-10-2334-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验