Putonti C, Pettitt Bm, Reid Jg, Fofanov Y
Department of Computer Science, University of Houston, Houston, Texas, USA.
Online J Bioinform. 2007 Jan 1;8(1):30-40.
Algorithms for motif identification in sequence space have predominately been focused on recognizing patterns of a fixed length containing regions of perfect conservation with possible regions of unconstrained sequence. Such motifs can be found in everything from proteins with distinct active sites to non-coding RNAs with specific structural elements that are necessary to maintain functionality. In the event that an insertion/deletion has occurred within an unconstrained portion of the pattern, it is possible that the pattern retains its functionality. In such a case the length of the pattern is now variable and may be overlooked when utilizing existing motif detection methods. The Pattern Island Detection Algorithm (PIDA) presented here has been developed to recognize patterns that have occurrences of varying length within sequences of any size alphabet. PIDA works by identifying all regions of perfect conservation (for lengths longer than a user-specified threshold), and then builds those conservation "islands" into fixed-length patterns. Next the algorithm modifies these fixed-length patterns by identifying additional (and different) islands that can be incorporated into each pattern through insertions/deletions within the "water" separating the islands. To provide some benchmarks for this analysis, PIDA was used to search for patterns within randomly generated sequences as well as sequences known to contain conserved patterns. For each of the patterns found, the statistical significance is calculated based upon the pattern's likelihood to appear by chance, thus providing a means to determine those patterns which are likely to have a functional role. The PIDA approach to motif finding is designed to perform best when searching for patterns of variable length although it is also able to identify patterns of a fixed length. PIDA has been created to be as generally applicable as possible since there are a variety of sequence problems of this type. The algorithm was implemented in C++ and is freely available upon request from the authors.
序列空间中基序识别算法主要集中于识别固定长度的模式,这些模式包含完全保守区域以及可能的无约束序列区域。此类基序存在于从具有独特活性位点的蛋白质到具有维持功能所需特定结构元件的非编码RNA等各种生物分子中。如果在模式的无约束部分发生了插入/缺失,该模式仍有可能保留其功能。在这种情况下,模式的长度现在是可变的,使用现有的基序检测方法时可能会被忽略。本文提出的模式岛检测算法(PIDA)旨在识别任意大小字母表序列中存在的长度可变的模式。PIDA的工作原理是识别所有完全保守区域(长度超过用户指定阈值),然后将这些保守“岛”构建成固定长度的模式。接下来,该算法通过识别可以通过分隔这些岛的“水域”中的插入/缺失纳入每个模式的其他(且不同的)岛来修改这些固定长度的模式。为了给该分析提供一些基准,PIDA被用于在随机生成的序列以及已知包含保守模式的序列中搜索模式。对于找到的每个模式,基于该模式偶然出现的可能性计算统计显著性,从而提供一种确定那些可能具有功能作用的模式的方法。PIDA寻找基序的方法旨在在搜索可变长度模式时表现最佳,尽管它也能够识别固定长度的模式。由于存在多种此类序列问题,PIDA已被设计为尽可能具有广泛适用性。该算法用C++实现,可根据作者要求免费获取。