Mendes Nuno D, Casimiro Ana C, Santos Pedro M, Sá-Correia Isabel, Oliveira Arlindo L, Freitas Ana T
INESC-ID, Instituto Superior Técnico, Rua Alves Redol 9 1000-029 Lisboa, Portugal.
Bioinformatics. 2006 Dec 15;22(24):2996-3002. doi: 10.1093/bioinformatics/btl537. Epub 2006 Oct 26.
The ability to identify complex motifs, i.e. non-contiguous nucleotide sequences, is a key feature of modern motif finders. Addressing this problem is extremely important, not only because these motifs can accurately model biological phenomena but because its extraction is highly dependent upon the appropriate selection of numerous search parameters. Currently available combinatorial algorithms have proved to be highly efficient in exhaustively enumerating motifs (including complex motifs), which fulfill certain extraction criteria. However, one major problem with these methods is the large number of parameters that need to be specified.
We propose a new algorithm, MUSA (Motif finding using an UnSupervised Approach), that can be used either to autonomously find over-represented complex motifs or to estimate search parameters for modern motif finders. This method relies on a biclustering algorithm that operates on a matrix of co-occurrences of small motifs. The performance of this method is independent of the composite structure of the motifs being sought, making few assumptions about their characteristics. The MUSA algorithm was applied to two datasets involving the bacterium Pseudomonas putida KT2440. The first one was composed of 70 sigma(54)-dependent promoter sequences and the second dataset included 54 promoter sequences of up-regulated genes in response to phenol, as suggested by quantitative proteomics. The results obtained indicate that this approach is very effective at identifying complex motifs of biological significance.
The MUSA algorithm is available upon request from the authors, and will be made available via a Web based interface.
识别复杂基序(即非连续核苷酸序列)的能力是现代基序查找工具的关键特性。解决这个问题极其重要,这不仅是因为这些基序能够精确地模拟生物学现象,还因为其提取高度依赖于众多搜索参数的恰当选择。目前可用的组合算法已被证明在详尽枚举满足特定提取标准的基序(包括复杂基序)方面非常高效。然而,这些方法的一个主要问题是需要指定大量参数。
我们提出了一种新算法MUSA(使用无监督方法进行基序查找),它既可以用于自主查找过度出现的复杂基序,也可以用于估计现代基序查找工具的搜索参数。该方法依赖于一种双聚类算法,该算法作用于小基序共现矩阵。此方法的性能与所寻找基序的复合结构无关,对其特征几乎不做假设。MUSA算法应用于两个涉及恶臭假单胞菌KT2440的数据集。第一个数据集由70个依赖σ54的启动子序列组成,第二个数据集包含如定量蛋白质组学所表明的54个响应苯酚而上调基因的启动子序列。所获得的结果表明,这种方法在识别具有生物学意义的复杂基序方面非常有效。
可向作者索取MUSA算法,并将通过基于网络的界面提供。