Eskin Eleazar, Pevzner Pavel A
Department of Computer Science, Columbia University, New York, 10027 NY, USA.
Bioinformatics. 2002;18 Suppl 1:S354-63. doi: 10.1093/bioinformatics/18.suppl_1.s354.
Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad patterns that occur near each other. A difficulty in discovering composite patterns is that one or both of the component monad patterns in the group may be 'too weak'. Since the traditional monad-based motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. In this paper, we present a MITRA (MIsmatch TRee Algorithm) approach for discovering composite signals. We demonstrate that MITRA performs well for both monad and composite patterns by presenting experiments over biological and synthetic data.
在未比对的DNA序列中发现模式是计算生物学中的一个基本问题,在寻找调控信号方面有重要应用。当前用于模式发现的方法主要集中在与相对较短连续字符串相对应的单碱基模式上。然而,许多实际的调控信号是复合模式,即彼此靠近出现的单碱基模式组。发现复合模式的一个困难在于该组中的一个或两个组成单碱基模式可能“太弱”。由于传统的基于单碱基的基序发现算法通常输出一个(或几个)高分模式,它们常常无法找到由弱单碱基部分组成的复合调控信号。在本文中,我们提出了一种用于发现复合信号的MITRA(错配树算法)方法。通过对生物数据和合成数据进行实验,我们证明了MITRA在单碱基模式和复合模式方面都表现良好。