Wijaya Edward, Rajaraman Kanagasabai, Yiu Siu-Ming, Sung Wing-Kin
Institute for Infocomm Research, Singapore.
Bioinformatics. 2007 Jun 15;23(12):1476-85. doi: 10.1093/bioinformatics/btm118. Epub 2007 May 5.
Identification of motifs is one of the critical stages in studying the regulatory interactions of genes. Motifs can have complicated patterns. In particular, spaced motifs, an important class of motifs, consist of several short segments separated by spacers of different lengths. Locating spaced motifs is not trivial. Existing motif-finding algorithms are either designed for monad motifs (short contiguous patterns with some mismatches) or have assumptions on the spacer lengths or can only handle at most two segments. An effective motif finder for generic spaced motifs is highly desirable.
This article proposes a novel approach for identifying spaced motifs with any number of spacers of different lengths. We introduce the notion of submotifs to capture the segments in the spaced motif and formulate the motif-finding problem as a frequent submotif mining problem. We provide an algorithm called SPACE to solve the problem. Based on experiments on real biological datasets, synthetic datasets and the motif assessment benchmarks by Tompa et al., we show that our algorithm performs better than existing tools for spaced motifs with improvements in both sensitivity and specificity and for monads, SPACE performs as good as other tools.
The source code is available upon request from the authors.
基序识别是研究基因调控相互作用的关键阶段之一。基序可能具有复杂的模式。特别是间隔基序,作为一类重要的基序,由几个被不同长度间隔隔开的短片段组成。定位间隔基序并非易事。现有的基序查找算法要么是为单联体基序(带有一些错配的短连续模式)设计的,要么对间隔长度有假设,要么最多只能处理两个片段。因此,非常需要一种有效的通用间隔基序查找器。
本文提出了一种新颖的方法来识别具有任意数量不同长度间隔的间隔基序。我们引入了子基序的概念来捕获间隔基序中的片段,并将基序查找问题表述为频繁子基序挖掘问题。我们提供了一种名为SPACE的算法来解决该问题。基于对真实生物数据集、合成数据集以及Tompa等人的基序评估基准的实验,我们表明我们的算法在灵敏度和特异性方面都比现有的间隔基序工具表现更好,对于单联体基序,SPACE的表现与其他工具相当。
可根据作者要求获取源代码。