IEEE Trans Cybern. 2020 Mar;50(3):1195-1208. doi: 10.1109/TCYB.2019.2896267. Epub 2019 Feb 20.
Mining useful patterns from varied types of databases is an important research topic, which has many real-life applications. Most studies have considered the frequency as sole interestingness measure to identify high-quality patterns. However, each object is different in nature. The relative importance of objects is not equal, in terms of criteria, such as the utility, risk, or interest. Besides, another limitation of frequent patterns is that they generally have a low occupancy, that is, they often represent small sets of items in transactions containing many items and, thus, may not be truly representative of these transactions. To extract high-quality patterns in real-life applications, this paper extends the occupancy measure to also assess the utility of patterns in transaction databases. We propose an efficient algorithm named high-utility occupancy pattern mining (HUOPM). It considers user preferences in terms of frequency, utility, and occupancy. A novel frequency-utility tree and two compact data structures, called the utility-occupancy list and frequency-utility table, are designed to provide global and partial downward closure properties for pruning the search space. The proposed method can efficiently discover the complete set of high-quality patterns without candidate generation. Extensive experiments have been conducted on several datasets to evaluate the effectiveness and efficiency of the proposed algorithm. Results show that the derived patterns are intelligible, reasonable, and acceptable, and that HUOPM with its pruning strategies outperforms the state-of-the-art algorithm, in terms of runtime and search space, respectively.
从各种类型的数据库中挖掘有用的模式是一个重要的研究课题,它有许多实际应用。大多数研究都考虑将频率作为唯一的有趣性度量来识别高质量的模式。然而,每个对象的性质都不同。在某些标准(如效用、风险或兴趣)方面,对象的相对重要性是不平等的。此外,频繁模式的另一个局限性是它们通常具有低占用率,也就是说,它们通常代表交易中许多项目中的一小部分项目,因此可能无法真正代表这些交易。为了在实际应用中提取高质量的模式,本文将占用率度量扩展到交易数据库中模式的效用评估。我们提出了一种名为高效用占用模式挖掘(HUOPM)的高效算法。它考虑了用户在频率、效用和占用率方面的偏好。设计了一种新颖的频率-效用树和两个紧凑的数据结构,称为效用-占用列表和频率-效用表,以提供全局和部分向下闭包属性,以修剪搜索空间。该方法可以在不生成候选的情况下有效地发现完整的高质量模式集。在几个数据集上进行了广泛的实验,以评估所提出算法的有效性和效率。结果表明,所得到的模式是可理解的、合理的和可接受的,并且所提出的算法及其修剪策略在运行时和搜索空间方面均优于最先进的算法。