Zhang Jingsong, Wang Yinglin, Zhang Chao, Shi Yongyong
IEEE/ACM Trans Comput Biol Bioinform. 2016 Sep-Oct;13(5):855-867. doi: 10.1109/TCBB.2015.2495132. Epub 2015 Oct 26.
The discovery of conserved sequential patterns in biological sequences is essential to unveiling common shared functions. Mining sequential generators as well as mining closed sequential patterns can contribute to a more concise result set than mining all sequential patterns, especially in the analysis of big data in bioinformatics. Previous studies have also presented convincing arguments that the generator is preferable to the closed pattern in inductive inference and classification. However, classic sequential generator mining algorithms, due to the lack of consideration on the contiguous constraint along with the lower-closed one, still pose a great challenge at spawning a large number of inefficient and redundant patterns, which is too huge for effective usage. Driven by some extensive applications of patterns with contiguous feature, we propose ConSgen, an efficient algorithm for discovering contiguous sequential generators. It adopts the n-gram model, called shingles, to generate potential frequent subsequences and leverages several pruning techniques to prune the unpromising parts of search space. And then, the contiguous sequential generators are identified by using the equivalence class-based lower-closure checking scheme. Our experiments on both DNA and protein data sets demonstrate the compactness, efficiency, and scalability of ConSgen.
在生物序列中发现保守的序列模式对于揭示共同的共享功能至关重要。挖掘序列生成器以及挖掘封闭序列模式比挖掘所有序列模式能产生更简洁的结果集,特别是在生物信息学中的大数据分析中。先前的研究也提出了令人信服的论据,即在归纳推理和分类中,生成器比封闭模式更可取。然而,经典的序列生成器挖掘算法由于缺乏对连续约束以及下封闭约束的考虑,在产生大量低效和冗余模式方面仍然面临巨大挑战,这些模式数量太多以至于无法有效使用。受具有连续特征的模式的一些广泛应用的驱动,我们提出了ConSgen,一种用于发现连续序列生成器的高效算法。它采用称为单字符组的n元语法模型来生成潜在的频繁子序列,并利用几种剪枝技术来修剪搜索空间中没有前景的部分。然后,通过使用基于等价类的下封闭检查方案来识别连续序列生成器。我们在DNA和蛋白质数据集上的实验证明了ConSgen的紧凑性、效率和可扩展性。