Lippert Ross A, Mobarry Clark M, Walenz Brian P
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
J Comput Biol. 2005 Sep;12(7):943-51. doi: 10.1089/cmb.2005.12.943.
Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows-Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used (1+2 log2 |summation|)(1+epsilon) bits per character (where epsilon is a small fraction). The construction algorithm ran in as much as twice that space, in O(| summation|n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, and only uses 1/2 (1+|summation|)(1+epsilon) bits per character, a factor of 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach.
精确字符串匹配算法在计算生物学中有着广泛的应用。支持各种精确字符串匹配查询的高效时间数据结构,如后缀树和后缀数组,已被应用于此类问题。随着序列数据库的增长,更节省空间的精确匹配方法变得越来越重要。一种这样的数据结构,即基于Burrows-Wheeler变换的压缩后缀数组(CSA),已被证明所需内存几乎与原始数据库的内存需求相等,同时能高效地支持常见类型的查询问题。然而,在高效的空间和时间内从序列构建CSA具有挑战性。2002年,提出了第一种节省空间的CSA构建算法。该实现每个字符使用(1 + 2 log2 |求和|)(1 + ε)位(其中ε是一个小分数)。构建算法在多达两倍的空间内运行,时间复杂度为O(|求和|n log(n))。我们创建了一个实现,它也能达到这些渐近界,但对于小字母表,每个字符仅使用1/2 (1 + |求和|)(1 + ε)位,对于核苷酸字母表,空间减少了一半。我们展示了我们在公开可用基因组数据上实现CSA构建和查询的时间和空间结果,证明了这种方法的实用性。