Suppr超能文献

突破压缩后缀数组和后缀树构建中的-障碍

Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.

作者信息

Kempa Dominik, Kociumaka Tomasz

机构信息

Stony Brook University.

Max Planck Institute for Informatics.

出版信息

Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.

Abstract

The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length- text, however, they use bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the (CSA) and the . Sadakane [SODA 2002] then showed how to augment them to obtain the (CST). For a length- text over an alphabet of size , these structures use only bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in time [Hon et al., FOCS 2003], which is factor away from the lower bound of (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit -time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take bits, support SA queries and full suffix tree functionality in time per operation, and can be constructed in time using bits of working space. (For example, if , the construction time is .) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call and . Using the novel techniques, we also develop a new index for pattern matching.

摘要

后缀数组描述给定文本后缀的字典序,后缀树则是所有后缀的路径压缩前缀树,它们是字符串处理中最基本的两种数据结构,在数据压缩、生物信息学和信息检索等领域有大量应用。然而,对于长度为 的文本,它们需要 位空间,这通常成本过高。为了解决这个问题,格罗西和维特 [《ACM 计算理论年会论文集》2000 年] 以及独立地,费拉吉纳和曼齐尼 [《计算机基础科学年会论文集》2000 年] 引入了后缀数组的空间高效版本,即 (压缩后缀数组)和 。笹钟根 [《离散算法年会论文集》2002 年] 随后展示了如何对它们进行扩充以得到 (压缩后缀树)。对于长度为 且字母表大小为 的文本,这些结构仅使用 位。如今,这些结构已成为标准工具箱的一部分:现代教科书用几十页描述它们的应用,并且在对空间要求严格的应用中它们几乎完全取代了后缀数组和后缀树。剩下的最大开放性问题是它们能以多高效的方式构建。二十年后,最快的算法仍然需要 时间运行 [洪等人,《计算机基础科学年会论文集》2003 年],这比 的下界(由于必须读取输入)慢了 倍。在本文中,我们通过提出一种新的压缩后缀数组和一种新的压缩后缀树,在 20 年来首次对这个问题的 做出改进,这两种结构允许 时间的构建算法,同时匹配原始压缩后缀数组/压缩后缀树和 FM 索引的空间界限及查询时间。更准确地说,我们的结构占用 位,每次操作在 时间内支持后缀数组查询和完整后缀树功能,并且可以使用 位工作空间在 时间内构建。(例如,如果 ,构建时间为 。)我们将此结果作为一个更一般的化简的推论得出:我们证明压缩后缀数组/树的所有参数(查询时间、空间、构建时间和构建工作空间)本质上都可以简化为一个数据结构的参数,该数据结构用于回答我们称为 和 的新查询类型。使用这些新技术,我们还开发了一种用于模式匹配的新索引。

相似文献

1
Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.
Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.
2
Compressed suffix tree--a basis for genome-scale sequence analysis.
Bioinformatics. 2007 Mar 1;23(5):629-30. doi: 10.1093/bioinformatics/btl681. Epub 2007 Jan 19.
3
Using the Sadakane compressed suffix tree to solve the all-pairs suffix-prefix problem.
Biomed Res Int. 2014;2014:745298. doi: 10.1155/2014/745298. Epub 2014 Apr 16.
4
An Elegant Algorithm for the Construction of Suffix Arrays.
J Discrete Algorithms (Amst). 2014 Jul 1;27:21-28. doi: 10.1016/j.jda.2014.03.001.
5
PFP Compressed Suffix Trees.
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.
6
A space-efficient construction of the Burrows-Wheeler transform for genomic data.
J Comput Biol. 2005 Sep;12(7):943-51. doi: 10.1089/cmb.2005.12.943.
7
Relative Suffix Trees.
Comput J. 2018 May;61(5):773-788. doi: 10.1093/comjnl/bxx108. Epub 2017 Nov 21.
8
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.
Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.
9
Suffix sorting via matching statistics.
Algorithms Mol Biol. 2024 Mar 12;19(1):11. doi: 10.1186/s13015-023-00245-z.
10
Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform.
Bioinformatics. 2016 Feb 15;32(4):497-504. doi: 10.1093/bioinformatics/btv603. Epub 2015 Oct 26.

本文引用的文献

1
PFP Compressed Suffix Trees.
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.
2
SOAP2: an improved ultrafast tool for short read alignment.
Bioinformatics. 2009 Aug 1;25(15):1966-7. doi: 10.1093/bioinformatics/btp336. Epub 2009 Jun 3.
3
Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18.
4
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验