Kempa Dominik, Kociumaka Tomasz
Stony Brook University.
Max Planck Institute for Informatics.
Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.
The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length- text, however, they use bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the (CSA) and the . Sadakane [SODA 2002] then showed how to augment them to obtain the (CST). For a length- text over an alphabet of size , these structures use only bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in time [Hon et al., FOCS 2003], which is factor away from the lower bound of (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit -time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take bits, support SA queries and full suffix tree functionality in time per operation, and can be constructed in time using bits of working space. (For example, if , the construction time is .) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call and . Using the novel techniques, we also develop a new index for pattern matching.
后缀数组描述给定文本后缀的字典序,后缀树则是所有后缀的路径压缩前缀树,它们是字符串处理中最基本的两种数据结构,在数据压缩、生物信息学和信息检索等领域有大量应用。然而,对于长度为 的文本,它们需要 位空间,这通常成本过高。为了解决这个问题,格罗西和维特 [《ACM 计算理论年会论文集》2000 年] 以及独立地,费拉吉纳和曼齐尼 [《计算机基础科学年会论文集》2000 年] 引入了后缀数组的空间高效版本,即 (压缩后缀数组)和 。笹钟根 [《离散算法年会论文集》2002 年] 随后展示了如何对它们进行扩充以得到 (压缩后缀树)。对于长度为 且字母表大小为 的文本,这些结构仅使用 位。如今,这些结构已成为标准工具箱的一部分:现代教科书用几十页描述它们的应用,并且在对空间要求严格的应用中它们几乎完全取代了后缀数组和后缀树。剩下的最大开放性问题是它们能以多高效的方式构建。二十年后,最快的算法仍然需要 时间运行 [洪等人,《计算机基础科学年会论文集》2003 年],这比 的下界(由于必须读取输入)慢了 倍。在本文中,我们通过提出一种新的压缩后缀数组和一种新的压缩后缀树,在 20 年来首次对这个问题的 做出改进,这两种结构允许 时间的构建算法,同时匹配原始压缩后缀数组/压缩后缀树和 FM 索引的空间界限及查询时间。更准确地说,我们的结构占用 位,每次操作在 时间内支持后缀数组查询和完整后缀树功能,并且可以使用 位工作空间在 时间内构建。(例如,如果 ,构建时间为 。)我们将此结果作为一个更一般的化简的推论得出:我们证明压缩后缀数组/树的所有参数(查询时间、空间、构建时间和构建工作空间)本质上都可以简化为一个数据结构的参数,该数据结构用于回答我们称为 和 的新查询类型。使用这些新技术,我们还开发了一种用于模式匹配的新索引。