密集数据中的快速挖掘：深度优先顺序下的概率支持预测应用

Quick mining in dense data: applying probabilistic support prediction in depth-first order.

作者信息

Sadeequllah Muhammad, Rauf Azhar, Rehman Saif Ur, Alnazzawi Noha

机构信息

Department of Computer Science, University of Peshawar, Peshawar, KP, Pakistan.

Computer Science and Engineering Department, Yanbu Industrial College, Yanbu, Saudi Arabia.

出版信息

PeerJ Comput Sci. 2024 Oct 4;10:e2334. doi: 10.7717/peerj-cs.2334. eCollection 2024.

DOI:10.7717/peerj-cs.2334

PMID:39650485

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11623086/

Abstract

Frequent itemset mining (FIM) is a major component in association rule mining, significantly influencing its performance. FIM is a computationally intensive nondeterministic polynomial time (NP)-hard problem. At the core of FIM is the task of computing support of candidate itemsets. This problem becomes more severe when the dataset is dense as the support is computed for millions, or even billions, of candidate itemsets. The rapid growth of data further exacerbates this problem. To achieve high scalability and efficiency, recently, researchers have proposed various approaches to approximate the support of an itemset using as small a subset of transaction data as possible. In addition to efficiency, accuracy is another important metric for these algorithms. They strive to increase true positives and reduce false negatives and false positives. One such recently proposed approximate FIM algorithm is Probabilistic Breadth-First (ProbBF), which is highly efficient for dense data due to its unique approach of not using transactional data beyond 2-size itemsets. Unlike other counterparts, this algorithm requires no additional input parameters beyond the traditional support threshold. However, ProbBF is a breadth-first algorithm, and it is well-established that breadth-first FIM algorithms consume significantly more memory than depth-first algorithms on dense datasets. It is also worth noting that significantly high memory consumption slows run-time performance of an algorithm due to low utilization of locality of reference, thrashing, and aggressive garbage collection . This article proposes a FIM algorithm, ProbDF, that discards transaction data after determining all frequent itemsets of sizes one and two. For frequent itemsets of size three or more, it employs a probabilistic support prediction model (PSPM) to predict their support probabilistically. PSPM, first proposed with ProbBF, uses lightweight calculations that exclude transaction data. Our experiments demonstrate that ProbDF, with its depth-first search strategy tailored to PSPM and other optimizations, is efficient in terms of time and space, and successfully generates the majority of frequent itemsets on real-world benchmark datasets. However, due to the probabilistic nature of ProbDF, some compromise in quality is inevitable.

摘要

频繁项集挖掘（FIM）是关联规则挖掘中的一个主要组成部分，对其性能有重大影响。FIM是一个计算密集型的非确定性多项式时间（NP）难题。FIM的核心任务是计算候选项集的支持度。当数据集很密集时，这个问题会变得更加严重，因为要为数百万甚至数十亿个候选项集计算支持度。数据的快速增长进一步加剧了这个问题。为了实现高可扩展性和效率，最近研究人员提出了各种方法，尽可能使用少量的事务数据子集来近似项集的支持度。除了效率之外，准确性是这些算法的另一个重要指标。它们努力增加真阳性，减少假阴性和假阳性。最近提出的一种近似FIM算法是概率广度优先（ProbBF）算法，由于其独特的方法，即不使用超过2项集的事务数据，因此对于密集数据非常高效。与其他同类算法不同，该算法除了传统的支持度阈值外，不需要其他输入参数。然而，ProbBF是一种广度优先算法，众所周知，在密集数据集上，广度优先FIM算法比深度优先算法消耗更多的内存。还值得注意的是，由于引用局部性利用率低、抖动和积极的垃圾回收，显著高的内存消耗会减慢算法的运行时性能。本文提出了一种FIM算法ProbDF，该算法在确定了所有大小为一和二的频繁项集后丢弃事务数据。对于大小为三或更大的频繁项集，它采用概率支持预测模型（PSPM）来概率性地预测它们的支持度。PSPM最初是与ProbBF一起提出的，使用不包括事务数据的轻量级计算。我们的实验表明，ProbDF通过其针对PSPM量身定制的深度优先搜索策略和其他优化，在时间和空间方面都很高效，并且在真实世界的基准数据集上成功生成了大多数频繁项集。然而，由于ProbDF的概率性质，质量上的一些折衷是不可避免的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/93f2/11623086/b477b9bf93ca/peerj-cs-10-2334-g001.jpg

相似文献

Quick mining in dense data: applying probabilistic support prediction in depth-first order.密集数据中的快速挖掘：深度优先顺序下的概率支持预测应用

PeerJ Comput Sci. 2024 Oct 4;10:e2334. doi: 10.7717/peerj-cs.2334. eCollection 2024.

Efficient Top-K Identical Frequent Itemsets Mining without Support Threshold Parameter from Transactional Datasets Produced by IoT-Based Smart Shopping Carts.从基于物联网的智能购物车生成的事务性数据集高效挖掘无支持阈值参数的 Top-K 相同频繁项集。

Sensors (Basel). 2022 Oct 21;22(20):8063. doi: 10.3390/s22208063.

An efficient pattern growth approach for mining fault tolerant frequent itemsets.一种用于挖掘容错频繁项集的高效模式增长方法。

Expert Syst Appl. 2020 Apr 1;143:113046. doi: 10.1016/j.eswa.2019.113046. Epub 2019 Oct 21.

The Mining Algorithm of Maximum Frequent Itemsets Based on Frequent Pattern Tree.基于频繁模式树的最大频繁项集挖掘算法。

Comput Intell Neurosci. 2022 May 18;2022:7022168. doi: 10.1155/2022/7022168. eCollection 2022.

Diagnosis of coronary artery disease using an efficient hash table based closed frequent itemsets mining.使用基于高效哈希表的封闭频繁项集挖掘技术诊断冠状动脉疾病。

Med Biol Eng Comput. 2018 May;56(5):749-759. doi: 10.1007/s11517-017-1719-6. Epub 2017 Sep 14.

A novel association rule mining approach using TID intermediate itemset.一种使用事务标识（TID）中间项集的新型关联规则挖掘方法。

PLoS One. 2018 Jan 19;13(1):e0179703. doi: 10.1371/journal.pone.0179703. eCollection 2018.

An efficient algorithm for mining closed itemsets.一种挖掘封闭项集的高效算法。

J Zhejiang Univ Sci. 2004 Jan;5(1):8-15. doi: 10.1007/BF02839306.

Marginal frequent itemset mining for fault prevention of railway overhead contact system.用于铁路架空接触网系统故障预防的边际频繁项集挖掘

ISA Trans. 2022 Jul;126:276-287. doi: 10.1016/j.isatra.2021.07.018. Epub 2021 Jul 13.

TKFIM: Top-K frequent itemset mining technique based on equivalence classes.TKFIM：基于等价类的Top-K频繁项集挖掘技术。

PeerJ Comput Sci. 2021 Mar 8;7:e385. doi: 10.7717/peerj-cs.385. eCollection 2021.

Mining privacy-preserving association rules using transaction hewer allocator and facile hash algorithm in multi-cloud environments.在多云环境中使用事务挖掘器分配器和简易哈希算法挖掘隐私保护关联规则。

MethodsX. 2025 Apr 17;14:103317. doi: 10.1016/j.mex.2025.103317. eCollection 2025 Jun.

本文引用的文献

Mining sequential patterns with flexible constraints from MOOC data.从大规模开放在线课程（MOOC）数据中挖掘具有灵活约束的序列模式。

Appl Intell (Dordr). 2022;52(14):16458-16474. doi: 10.1007/s10489-021-03122-7. Epub 2022 Mar 23.

密集数据中的快速挖掘：深度优先顺序下的概率支持预测应用

Quick mining in dense data: applying probabilistic support prediction in depth-first order.

作者信息

Sadeequllah Muhammad, Rauf Azhar, Rehman Saif Ur, Alnazzawi Noha

机构信息

Department of Computer Science, University of Peshawar, Peshawar, KP, Pakistan.

Computer Science and Engineering Department, Yanbu Industrial College, Yanbu, Saudi Arabia.

出版信息

PeerJ Comput Sci. 2024 Oct 4;10:e2334. doi: 10.7717/peerj-cs.2334. eCollection 2024.

DOI:10.7717/peerj-cs.2334

PMID:39650485

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11623086/

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

密集数据中的快速挖掘：深度优先顺序下的概率支持预测应用

Quick mining in dense data: applying probabilistic support prediction in depth-first order.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

密集数据中的快速挖掘：深度优先顺序下的概率支持预测应用

Quick mining in dense data: applying probabilistic support prediction in depth-first order.

作者信息

机构信息

出版信息

相似文献

本文引用的文献