Navarro Osvaldo, Cumplido René, Villaseñor-Pineda Luis, Feregrino-Uribe Claudia, Carrasco-Ochoa Jesús Ariel
Departamento de Ciencias Computacionales, Instituto Nacional de Astrofísica, Óptica y Electrónica, Sta. Ma. Tonantzintla, Puebla, México.
PLoS One. 2014 Jun 16;9(6):e95418. doi: 10.1371/journal.pone.0095418. eCollection 2014.
Sequential Pattern Mining is a widely addressed problem in data mining, with applications such as analyzing Web usage, examining purchase behavior, and text mining, among others. Nevertheless, with the dramatic increase in data volume, the current approaches prove inefficient when dealing with large input datasets, a large number of different symbols and low minimum supports. In this paper, we propose a new sequential pattern mining algorithm, which follows a pattern-growth scheme to discover sequential patterns. Unlike most pattern growth algorithms, our approach does not build a data structure to represent the input dataset, but instead accesses the required sequences through pseudo-projection databases, achieving better runtime and reducing memory requirements. Our algorithm traverses the search space in a depth-first fashion and only preserves in memory a pattern node linkage and the pseudo-projections required for the branch being explored at the time. Experimental results show that our new approach, the Node Linkage Depth-First Traversal algorithm (NLDFT), has better performance and scalability in comparison with state of the art algorithms.
序列模式挖掘是数据挖掘中一个被广泛研究的问题,其应用包括分析网页使用情况、研究购买行为以及文本挖掘等。然而,随着数据量的急剧增加,当前的方法在处理大型输入数据集、大量不同符号和低最小支持度时效率低下。在本文中,我们提出了一种新的序列模式挖掘算法,该算法遵循模式增长方案来发现序列模式。与大多数模式增长算法不同,我们的方法不构建数据结构来表示输入数据集,而是通过伪投影数据库访问所需序列,从而实现更好的运行时性能并降低内存需求。我们的算法以深度优先的方式遍历搜索空间,并且仅在内存中保留模式节点链接和当时正在探索的分支所需的伪投影。实验结果表明,我们的新方法,即节点链接深度优先遍历算法(NLDFT),与现有算法相比具有更好的性能和可扩展性。