College of Information Engineering, Yangzhou University, Yangzhou, Jiangsu 225009, China; National Key Lab of Novel Software Tech, Nanjing University, Nanjing 210093, China.
Comput Biol Med. 2013 Oct;43(10):1444-52. doi: 10.1016/j.compbiomed.2013.07.009. Epub 2013 Jul 27.
Existing algorithms for mining frequent patterns in multiple biosequences may generate multiple projected databases and short candidate patterns, which can increase computation time and memory requirement. In order to overcome such shortcomings, we propose a fast and efficient algorithm for mining frequent patterns in multiple biological sequences (MSPM). We first present the concept of a primary pattern, which can be extended to form larger patterns in the sequence. To detect frequent primary patterns, a prefix tree is constructed. Based on this prefix tree, a pattern-extending approach is also presented to mine frequent patterns without producing a large number of irrelevant candidate patterns. The experimental results show that the MSPM algorithm can achieve not only faster speed, but also higher quality results as compared with other methods.
现有的多生物序列频繁模式挖掘算法可能会生成多个投影数据库和短候选模式,这会增加计算时间和内存需求。为了克服这些缺点,我们提出了一种快速有效的多生物序列频繁模式挖掘算法(MSPM)。我们首先提出了主模式的概念,它可以在序列中扩展形成更大的模式。为了检测频繁的主模式,构建了一个前缀树。基于这个前缀树,我们还提出了一种模式扩展方法,用于挖掘频繁模式,而不会产生大量不相关的候选模式。实验结果表明,与其他方法相比,MSPM 算法不仅速度更快,而且结果质量更高。