Suppr超能文献

PFP压缩后缀树

PFP Compressed Suffix Trees.

作者信息

Boucher Christina, Cvacho Ondřej, Gagie Travis, Holub Jan, Manzini Giovanni, Navarro Gonzalo, Rossi Massimiliano

机构信息

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA.

Department of Theoretical Computer Science, Czech Technical University in Prague, Czech Republic.

出版信息

Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.

Abstract

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string , it produces a dictionary and a parse of overlapping phrases such that BWT() can be computed from and in time and workspace bounded in terms of their combined size |PFP()|. In practice and are significantly smaller than and computing BWT() from them is more efficient than computing it from directly, at least when is the concatenation of many genomes. In this paper, we consider PFP() as a and show how it can be augmented to support still built and fitting within (|PFP()|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in ; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

摘要

无前缀解析(PFP)由布歇等人(2019年)提出,作为一种预处理步骤,以简化基因组数据库的Burrows-Wheeler变换(BWT)的计算。给定一个字符串,它会生成一个字典和一个重叠短语的解析,使得可以根据和在其组合大小|PFP()|所界定的时间和工作空间内计算出BWT()。在实践中,和明显小于,并且从它们计算BWT()比直接从计算更有效,至少当是许多基因组的串联时。在本文中,我们将PFP()视为一种,并展示如何对其进行扩展以支持仍然在(|PFP()|)空间内构建和适配。这需要高效计算各种原语来模拟后缀树:计算中两个位置的最长公共扩展(LCE);读取其后缀数组(SA)、其逆(ISA)、其BWT及其最长公共前缀数组(LCP)的任何单元格;以及计算范围内最小值和LCP上的下一个/上一个较小值查询。我们的实验结果表明,对于非常大的重复数据集,可以有效地构建PFP后缀树,并且其操作与只能处理小得多的数据集的其他压缩后缀树相比具有竞争力。

相似文献

1
PFP Compressed Suffix Trees.
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.
2
Recursive Prefix-Free Parsing for Building Big BWTs.
bioRxiv. 2023 Jan 20:2023.01.18.524557. doi: 10.1101/2023.01.18.524557.
3
Recursive Prefix-Free Parsing for Building Big BWTs.
Proc Data Compress Conf. 2023 Mar;2023:62-70. Epub 2023 May 19.
4
Prefix-free parsing for building big BWTs.
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.
5
Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.
Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.
6
External memory BWT and LCP computation for sequence collections with applications.
Algorithms Mol Biol. 2019 Mar 8;14:6. doi: 10.1186/s13015-019-0140-0. eCollection 2019.
7
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.
Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.
8
Computing the original eBWT faster, simpler, and with less memory.
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
9
Computing matching statistics on Wheeler DFAs.
Proc Data Compress Conf. 2023 Mar;2023:150-159. doi: 10.1109/dcc55655.2023.00023. Epub 2023 May 19.
10
r-indexing the eBWT.
Int Symp String Process Inf Retr. 2021 Oct;12944:3-12. doi: 10.1007/978-3-030-86692-1_1. Epub 2021 Sep 27.

引用本文的文献

1
Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.
Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.
2
An Upper Bound and Linear-Space Queries on the LZ-End Parsing.
Proc Annu ACM SIAM Symp Discret Algorithms. 2022;2022:2847-2866. doi: 10.1137/1.9781611977073.111.
3
CSTs for Terabyte-Sized Data.
Proc Data Compress Conf. 2022 Mar;2022:93-102. doi: 10.1109/dcc52660.2022.00017. Epub 2022 Jul 4.
4
A survey of BWT variants for string collections.
Bioinformatics. 2024 May 24;40(7). doi: 10.1093/bioinformatics/btae333.
5
Computing the original eBWT faster, simpler, and with less memory.
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
6
Computational graph pangenomics: a tutorial on data structures and their applications.
Nat Comput. 2022 Mar;21(1):81-108. doi: 10.1007/s11047-022-09882-6. Epub 2022 Mar 4.
7
MONI: A Pangenomic Index for Finding Maximal Exact Matches.
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.

本文引用的文献

1
Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.
J Comput Biol. 2020 Apr;27(4):500-513. doi: 10.1089/cmb.2019.0309. Epub 2020 Mar 16.
2
Prefix-free parsing for building big BWTs.
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.
3
The Public Health Impact of a Publically Available, Environmental Database of Microbial Genomes.
Front Microbiol. 2017 May 9;8:808. doi: 10.3389/fmicb.2017.00808. eCollection 2017.
4
Computational pan-genomics: status, promises and challenges.
Brief Bioinform. 2018 Jan 1;19(1):118-135. doi: 10.1093/bib/bbw089.
5
A global reference for human genetic variation.
Nature. 2015 Oct 1;526(7571):68-74. doi: 10.1038/nature15393.
6
Storage and retrieval of highly repetitive sequence collections.
J Comput Biol. 2010 Mar;17(3):281-308. doi: 10.1089/cmb.2009.0169.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验