Liao Vance, Chen Ming-Syan
BMC Syst Biol. 2013;7 Suppl 4(Suppl 4):S7. doi: 10.1186/1752-0509-7-S4-S7. Epub 2013 Oct 23.
Pattern mining for biological sequences is an important problem in bioinformatics and computational biology. Biological data mining yield impact in diverse biological fields, such as discovery of co-occurring biosequences, which is important for biological data analyses. The approaches of mining sequential patterns can discover all-length motifs of biological sequences. Nevertheless, traditional approaches of mining sequential patterns inefficiently mine DNA and protein data since the data have fewer letters and lengthy sequences. Furthermore, gap constraints are important in computational biology since they cope with irrelative regions, which are not conserved in evolution of biological sequences.
We devise an approach to efficiently mine sequential patterns (motifs) with gap constraints in biological sequences. The approach is the Depth-First Spelling algorithm for mining sequential patterns of biological sequences with Gap constraints (termed DFSG).
PrefixSpan is one of the most efficient methods in traditional approaches of mining sequential patterns, and it is the basis of GenPrefixSpan. GenPrefixSpan is an approach built on PrefixSpan with gap constraints, and therefore we compare DFSG with GenPrefixSpan. In the experimental results, DFSG mines biological sequences much faster than GenPrefixSpan.
生物序列的模式挖掘是生物信息学和计算生物学中的一个重要问题。生物数据挖掘在多个生物领域都有影响,比如共现生物序列的发现,这对生物数据分析很重要。挖掘序列模式的方法可以发现生物序列的所有长度的基序。然而,传统的挖掘序列模式的方法在处理DNA和蛋白质数据时效率低下,因为这些数据的字母较少且序列较长。此外,间隙约束在计算生物学中很重要,因为它们处理在生物序列进化中不保守的无关区域。
我们设计了一种方法来有效地挖掘具有间隙约束的生物序列中的序列模式(基序)。该方法是用于挖掘具有间隙约束的生物序列的序列模式的深度优先拼写算法(称为DFSG)。
PrefixSpan是传统挖掘序列模式方法中最有效的方法之一,它是GenPrefixSpan的基础。GenPrefixSpan是一种基于带有间隙约束的PrefixSpan构建的方法,因此我们将DFSG与GenPrefixSpan进行比较。在实验结果中,DFSG挖掘生物序列的速度比GenPrefixSpan快得多。