Iqbal Saood, Shahid Abdul, Roman Muhammad, Khan Zahid, Al-Otaibi Shaha, Yu Lisu
Institute of Computing, Kohat University of Science & Technology, Kohat, Kohat, KPK, Pakistan.
Robotics and Internet of Things Lab, Prince Sultan University, Riyadh, Saudi Arabia.
PeerJ Comput Sci. 2021 Mar 8;7:e385. doi: 10.7717/peerj-cs.385. eCollection 2021.
Frequently used items mining is a significant subject of data mining studies. In the last ten years, due to innovative development, the quantity of data has grown exponentially. For frequent Itemset (FIs) mining applications, it imposes new challenges. Misconceived information may be found in recent algorithms, including both threshold and size based algorithms. Threshold value plays a central role in generating frequent itemsets from the given dataset. Selecting a support threshold value is very complicated for those unaware of the dataset's characteristics. The performance of algorithms for finding FIs without the support threshold is, however, deficient due to heavy computation. Therefore, we have proposed a method to discover FIs without the support threshold, called Top-k frequent itemsets mining (TKFIM). It uses class equivalence and set-theory concepts for mining FIs. The proposed procedure does not miss any FIs; thus, accurate frequent patterns are mined. Furthermore, the results are compared with state-of-the-art techniques such as Top-k miner and Build Once and Mine Once (BOMO). It is found that the proposed TKFIM has outperformed the results of these approaches in terms of execution and performance, achieving 92.70, 35.87, 28.53, and 81.27 percent gain on Top-k miner using Chess, Mushroom, and Connect and T1014D100K datasets, respectively. Similarly, it has achieved a performance gain of 97.14, 100, 78.10, 99.70 percent on BOMO using Chess, Mushroom, Connect, and T1014D100K datasets, respectively. Therefore, it is argued that the proposed procedure may be adopted on a large dataset for better performance.
频繁项挖掘是数据挖掘研究中的一个重要课题。在过去十年中,由于创新发展,数据量呈指数级增长。对于频繁项集(FI)挖掘应用来说,这带来了新的挑战。在最近的算法中可能会发现误解信息,包括基于阈值和大小的算法。阈值在从给定数据集中生成频繁项集时起着核心作用。对于那些不了解数据集特征的人来说,选择支持阈值非常复杂。然而,没有支持阈值的频繁项集查找算法由于计算量过大,性能存在缺陷。因此,我们提出了一种无需支持阈值来发现频繁项集的方法,称为Top-k频繁项集挖掘(TKFIM)。它使用类等价和集合论概念来挖掘频繁项集。所提出的过程不会遗漏任何频繁项集;因此,可以挖掘出准确的频繁模式。此外,将结果与最新技术进行了比较,如Top-k挖掘器和一次构建一次挖掘(BOMO)。结果发现,所提出的TKFIM在执行和性能方面优于这些方法的结果,在使用国际象棋、蘑菇、连接和T1014D100K数据集时,相对于Top-k挖掘器分别实现了92.70%、35.87%、28.53%和81.27%的性能提升。同样,在使用国际象棋、蘑菇、连接和T1014D100K数据集时,相对于BOMO分别实现了97.14%、100%、78.10%和99.70%的性能提升。因此,可以认为所提出的过程可以应用于大型数据集以获得更好的性能。