Kim Justin, Varki Rahul, Oliva Marco, Boucher Christina
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, 32607.
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, 32607.
bioRxiv. 2024 Jul 16:2024.07.11.603142. doi: 10.1101/2024.07.11.603142.
The RePair compression algorithm produces a context-free grammar by iteratively substituting the most frequently occurring pair of consecutive symbols with a new symbol until all consecutive pairs of symbols appear only once in the compressed text. It is widely used in the settings of bioinformatics, machine learning, and information retrieval where random access to the original input text is needed. For example, in pangenomics, RePair is used for random access to a population of genomes. BigRePair improves the scalability of the original RePair algorithm by using Prefix-Free Parsing (PFP) to preprocess the text prior to building the RePair grammar. Despite the efficiency of PFP on repetitive text, there is a scalability issue with the size of the parse which causes a memory bottleneck in BigRePair. In this paper, we design and implement recursive RePair (denoted as ), which builds the RePair grammar using recursive PFP. Our novel algorithm faces the challenge of constructing the RePair grammar without direct access to the parse of text, relying solely on the dictionary of the text and the parse and dictionary of the parse of the text. We compare to BigRePair using SARS-CoV-2 haplotypes and haplotypes from the 1000 Genomes Project. We show that our method achieves over a 40% peak memory reduction and a speed up ranging between 12% to 79% compared to BigRePair when compressing the largest input texts in all experiments. is made publicly available under the GNU public license here: https://github.com/jkim210/Recursive-RePair.
RePair压缩算法通过迭代地用一个新符号替换最频繁出现的连续符号对,直到所有连续符号对在压缩文本中只出现一次,从而生成一个上下文无关语法。它广泛应用于生物信息学、机器学习和信息检索等需要随机访问原始输入文本的场景。例如,在泛基因组学中,RePair用于随机访问一组基因组。BigRePair通过在构建RePair语法之前使用无前缀解析(PFP)对文本进行预处理,提高了原始RePair算法的可扩展性。尽管PFP在重复文本上效率很高,但解析大小存在可扩展性问题,这在BigRePair中导致了内存瓶颈。在本文中,我们设计并实现了递归RePair(表示为 ),它使用递归PFP构建RePair语法。我们的新算法面临着在不直接访问文本解析的情况下构建RePair语法的挑战,仅依赖于文本字典以及文本解析的解析和字典。我们使用SARS-CoV-2单倍型和千人基因组计划的单倍型将 与BigRePair进行比较。我们表明,在所有实验中压缩最大输入文本时,与BigRePair相比,我们的方法 实现了超过40%的峰值内存减少,加速比在12%到79%之间。 根据GNU公共许可证在此处公开提供:https://github.com/jkim210/Recursive-RePair。