Boucher Christina, Cvacho Ondřej, Gagie Travis, Holub Jan, Manzini Giovanni, Navarro Gonzalo, Rossi Massimiliano
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA.
Department of Theoretical Computer Science, Czech Technical University in Prague, Czech Republic.
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.
Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string , it produces a dictionary and a parse of overlapping phrases such that BWT() can be computed from and in time and workspace bounded in terms of their combined size |PFP()|. In practice and are significantly smaller than and computing BWT() from them is more efficient than computing it from directly, at least when is the concatenation of many genomes. In this paper, we consider PFP() as a and show how it can be augmented to support still built and fitting within (|PFP()|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in ; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.
无前缀解析(PFP)由布歇等人(2019年)提出,作为一种预处理步骤,以简化基因组数据库的Burrows-Wheeler变换(BWT)的计算。给定一个字符串,它会生成一个字典和一个重叠短语的解析,使得可以根据和在其组合大小|PFP()|所界定的时间和工作空间内计算出BWT()。在实践中,和明显小于,并且从它们计算BWT()比直接从计算更有效,至少当是许多基因组的串联时。在本文中,我们将PFP()视为一种,并展示如何对其进行扩展以支持仍然在(|PFP()|)空间内构建和适配。这需要高效计算各种原语来模拟后缀树:计算中两个位置的最长公共扩展(LCE);读取其后缀数组(SA)、其逆(ISA)、其BWT及其最长公共前缀数组(LCP)的任何单元格;以及计算范围内最小值和LCP上的下一个/上一个较小值查询。我们的实验结果表明,对于非常大的重复数据集,可以有效地构建PFP后缀树,并且其操作与只能处理小得多的数据集的其他压缩后缀树相比具有竞争力。