Brown Nathaniel K, Gagie Travis, Rossi Massimiliano
Faculty of Computer Science, Dalhousie University, NS, Canada.
Department of Computer and Information Science and Engineering, University of Florida, FL, USA.
Lebniz Int Proc Inform. 2022 Jul;233. Epub 2022 Jul 11.
Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation , it stores an -space table - where is the number of positions where either or - that enables the computation of successive values of by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing is constant while maintaining -space. In this paper we refine Nishimoto and Tabei's approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation corresponding to the LF-mapping that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-stepping is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-stepping is competitive both in time and space with the best existing implementations.
直到最近,大多数专家可能都会认同,使用游程长度压缩的Burrows-Wheeler变换(RLBWT)无法在常数时间内进行反向步长操作,因为这样做依赖于对稀疏位向量的秩查询,而这些查询继承了前驱查询的下界。然而,在ICALP '21会议上,西本和多倍描述了一种新的、简单的常数时间实现方法。对于一个排列 ,它存储一个 空间的表——其中 是满足 或 的位置 的数量——这使得通过查表和线性扫描来计算 的连续值成为可能。西本和多倍展示了如何增加表中的行数来限制线性扫描的长度,从而使得计算 的查询时间为常数,同时保持 空间。在本文中,我们改进了西本和多倍的方法,包括时间-空间权衡,并通过实验评估了不同的实现方法,证明了他们部分结果的实用性。我们表明,即使不向表中添加行,在实际操作中,我们在查询期间几乎总是只扫描少数几个条目。我们提出了一种与LF映射相对应的排列 的分解方案,该方案允许改进数据结构的压缩,同时限制查询时间。我们在真实世界的基因组数据集上测试了我们的实现方法,发现不压缩表时,反向步长操作比使用稀疏位向量实现方法要快得多,但不幸的是,所使用的空间也大得多。压缩后,反向步长操作在时间和空间上与现有的最佳实现方法具有竞争力。