Rajasekaran Sanguthevar, Dinh Hieu
Department of CSE, University of Connecticut, Storrs, CT 06269, USA.
BMC Res Notes. 2011 Mar 8;4:54. doi: 10.1186/1756-0500-4-54.
The discovery of patterns in DNA, RNA, and protein sequences has led to the solution of many vital biological problems. For instance, the identification of patterns in nucleic acid sequences has resulted in the determination of open reading frames, identification of promoter elements of genes, identification of intron/exon splicing sites, identification of SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have proven to be extremely helpful in domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, etc. Motifs are important patterns that are helpful in finding transcriptional regulatory elements, transcription factor binding sites, functional genomics, drug design, etc. As a result, numerous papers have been written to solve the motif search problem.
Three versions of the motif search problem have been proposed in the literature: Simple Motif Search (SMS), (l, d)-motif search (or Planted Motif Search (PMS)), and Edit-distance-based Motif Search (EMS). In this paper we focus on PMS. Two kinds of algorithms can be found in the literature for solving the PMS problem: exact and approximate. An exact algorithm identifies the motifs always and an approximate algorithm may fail to identify some or all of the motifs. The exact version of PMS problem has been shown to be NP-hard. Exact algorithms proposed in the literature for PMS take time that is exponential in some of the underlying parameters. In this paper we propose a generic technique that can be used to speedup PMS algorithms.
We present a speedup technique that can be used on any PMS algorithm. We have tested our speedup technique on a number of algorithms. These experimental results show that our speedup technique is indeed very effective. The implementation of algorithms is freely available on the web at http://www.engr.uconn.edu/rajasek/PMS4.zip.
DNA、RNA和蛋白质序列中模式的发现推动了许多重要生物学问题的解决。例如,核酸序列中模式的识别导致了开放阅读框的确定、基因启动子元件的识别、内含子/外显子剪接位点的识别、短干扰RNA的识别、RNA降解信号的定位、可变剪接位点的识别等。在蛋白质序列中,模式已被证明在结构域识别、蛋白酶切割位点定位、信号肽识别、蛋白质相互作用、蛋白质降解元件确定、蛋白质转运元件识别等方面非常有用。基序是有助于发现转录调控元件、转录因子结合位点、功能基因组学、药物设计等的重要模式。因此,已经有大量论文致力于解决基序搜索问题。
文献中提出了基序搜索问题的三个版本:简单基序搜索(SMS)、(l, d)-基序搜索(或植入基序搜索(PMS))和基于编辑距离的基序搜索(EMS)。在本文中,我们关注PMS。文献中可找到两种用于解决PMS问题的算法:精确算法和近似算法。精确算法总能识别出基序,而近似算法可能无法识别部分或全部基序。PMS问题的精确版本已被证明是NP难的。文献中为PMS提出的精确算法在某些基础参数上花费的时间是指数级的。在本文中,我们提出了一种可用于加速PMS算法的通用技术。
我们提出了一种可用于任何PMS算法的加速技术。我们在多种算法上测试了我们的加速技术。这些实验结果表明我们的加速技术确实非常有效。算法的实现可在网页http://www.engr.uconn.edu/rajasek/PMS4.zip上免费获取。