Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA.
BMC Bioinformatics. 2013 May 16;14:161. doi: 10.1186/1471-2105-14-161.
Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l)with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.
In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic's algorithm.
Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller)subset of the stems output by Kuksa and Pavlovic's algorithm.
基序是 DNA、RNA 和蛋白质序列中的重要模式,在生物过程和功能中发挥着重要作用,例如识别开放阅读框、RNA 转录、蛋白质结合等。文献中已经研究了几种基序搜索问题的版本。其中一个版本称为“已种植基序搜索(PMS)”或(l,d)-基序搜索。PMS 是 NP 完全问题。大多数已种植基序搜索算法的时间复杂度都与字母表的大小呈指数级相关。最近,Kuksa 和 Pavlovic 引入了一个新的基序搜索问题版本。我们称之为“基序茎搜索(MSS)”问题。基序茎是一个 l-mer(对于某些相关的 l 值),带有一些通配符字符,因此对应于一组 l-mers(没有通配符),其中一些是(l,d)-基序。Kuksa 和 Pavlovic 提出了一种有效的算法来为来自大字母表的输入找到基序茎。理想情况下,输出的茎数应尽可能少,因为茎形成基序的超集。
在本文中,我们提出了一种有效的 MSS 算法,并在合成数据和真实数据上对其进行了评估。该评估表明,我们的算法比 Kuksa 和 Pavlovic 的算法快得多。
我们的 MSS 算法在运行时间和输出茎数方面都优于 Kuksa 和 Pavlovic 的算法。具体来说,我们的算法输出的茎形成了 Kuksa 和 Pavlovic 算法输出的茎的适当(且小得多)子集。