Shang Jingbo, Peng Jian, Han Jiawei
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Proc SIAM Int Conf Data Min. 2016 May;2016:558-566. doi: 10.1137/1.9781611974348.63.
Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.
旨在寻找连续模式子串的连续模式挖掘是频繁模式挖掘的一种特殊情况,并且在许多实际应用中发挥了关键作用,尤其是在生物序列分析、时间序列分析和网络日志挖掘中。字符串之间的近似,包括插入、删除和替换,在生物序列比较中被广泛使用。然而,大多数现有的字符串模式挖掘方法只考虑汉明距离,而不考虑插入/删除(indels)。由于计算复杂度高,特别是对于具有数十亿碱基对的DNA序列,编辑距离下的一般近似连续频繁模式挖掘很少受到关注。在本文中,我们针对这个问题引入了一种有效的解决方案。我们首先提出了最大近似连续频繁模式挖掘(MACFP)问题,该问题在长查询序列中识别编辑距离下的子串模式。然后,我们提出了一种具有线性时间复杂度的新颖算法,用于检查查询序列中子串模式的支持度是否高于预定义阈值,从而大大降低了MACFP的计算复杂度。借助这种快速决策算法,我们可以通过几种索引和搜索技术有效地解决原始模式发现问题。与几种现有方法相比,在序列模式分析方面的综合实验以及对癌症基因组学应用的研究证明了我们算法的有效性和效率。