Bonizzoni Paola, Boucher Christina, Cozzi Davide, Gagie Travis, Köppl Dominik, Rossi Massimiliano
University of Milano-Bicocca, Milano, Italy.
University of Florida, Gainesville, FL.
Int Symp String Process Inf Retr. 2023 Sep;14240:89-101. doi: 10.1007/978-3-031-43980-3_8. Epub 2023 Sep 20.
The positional Burrows-Wheeler Transform (PBWT) was presented as a means to find set-maximal exact matches (SMEMs) in haplotype data via the computation of the divergence array. Although run-length encoding the PBWT has been previously considered, storing the divergence array along with the PBWT in a compressed manner has not been as rigorously studied. We define two queries that can be used in combination to compute SMEMs, allowing us to define smaller data structures that support one or both of these queries. We combine these data structures, enabling the PBWT and the divergence array to be stored in a manner that allows for finding SMEMs. We estimate and compare the memory usage of these data structures, leading to one data structure that is most memory efficient. Lastly, we implement this data structure and compare its performance to prior methods using various datasets taken from the 1000 Genomes Project data.
位置布罗-惠勒变换(PBWT)被提出作为一种通过计算差异数组在单倍型数据中找到集合最大精确匹配(SMEM)的方法。尽管之前已经考虑过对PBWT进行游程编码,但以压缩方式存储差异数组和PBWT尚未得到如此严格的研究。我们定义了两个可以结合使用以计算SMEM的查询,这使我们能够定义支持其中一个或两个查询的更小的数据结构。我们将这些数据结构组合起来,使PBWT和差异数组能够以一种允许找到SMEM的方式存储。我们估计并比较这些数据结构的内存使用情况,得出一种内存效率最高的数据结构。最后,我们实现了这种数据结构,并将其性能与使用来自千人基因组计划数据的各种数据集的先前方法进行比较。