Rigoutsos I, Floratos A
Computational Biology Center, IBM Thomas J. Watson Research Center, York Town Heights, NY 10598, USA.
Bioinformatics. 1998;14(1):55-67. doi: 10.1093/bioinformatics/14.1.55.
The discovery of motifs in biological sequences is an important problem.
This paper presents a new algorithm for the discovery of rigid patterns (motifs) in biological sequences. Our method is combinatorial in nature and able to produce all patterns that appear in at least a (user-defined) minimum number of sequences, yet it manages to be very efficient by avoiding the enumeration of the entire pattern space. Furthermore, the reported patterns are maximal: any reported pattern cannot be made more specific and still keep on appearing at the exact same positions within the input sequences. The effectiveness of the proposed approach is showcased on a number of test cases which aim to: (i) validate the approach through the discovery of previously reported patterns; (ii) demonstrate the capability to identify automatically highly selective patterns particular to the sequences under consideration. Finally, experimental analysis indicates that the algorithm is output sensitive, i.e. its running time is quasi-linear to the size of the generated output.
在生物序列中发现基序是一个重要问题。
本文提出了一种用于在生物序列中发现刚性模式(基序)的新算法。我们的方法本质上是组合式的,能够生成在至少(用户定义的)最小数量序列中出现的所有模式,并且通过避免枚举整个模式空间设法做到非常高效。此外,所报告的模式是最大的:任何所报告的模式都不能变得更具体,并且仍然在输入序列中的相同位置出现。在一些测试用例上展示了所提出方法的有效性,这些测试用例旨在:(i)通过发现先前报告的模式来验证该方法;(ii)展示识别所考虑序列特有的自动高度选择性模式的能力。最后,实验分析表明该算法对输出敏感,即其运行时间与生成的输出大小近似呈线性关系。