• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

RePair:通过减少内存使用来提高RePair的可扩展性。

RePair: Increasing the Scalability of RePair by Decreasing Memory Usage.

作者信息

Kim Justin, Varki Rahul, Oliva Marco, Boucher Christina

机构信息

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, 32607.

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, 32607.

出版信息

bioRxiv. 2024 Jul 16:2024.07.11.603142. doi: 10.1101/2024.07.11.603142.

DOI:10.1101/2024.07.11.603142
PMID:39071359
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11275962/
Abstract

The RePair compression algorithm produces a context-free grammar by iteratively substituting the most frequently occurring pair of consecutive symbols with a new symbol until all consecutive pairs of symbols appear only once in the compressed text. It is widely used in the settings of bioinformatics, machine learning, and information retrieval where random access to the original input text is needed. For example, in pangenomics, RePair is used for random access to a population of genomes. BigRePair improves the scalability of the original RePair algorithm by using Prefix-Free Parsing (PFP) to preprocess the text prior to building the RePair grammar. Despite the efficiency of PFP on repetitive text, there is a scalability issue with the size of the parse which causes a memory bottleneck in BigRePair. In this paper, we design and implement recursive RePair (denoted as ), which builds the RePair grammar using recursive PFP. Our novel algorithm faces the challenge of constructing the RePair grammar without direct access to the parse of text, relying solely on the dictionary of the text and the parse and dictionary of the parse of the text. We compare to BigRePair using SARS-CoV-2 haplotypes and haplotypes from the 1000 Genomes Project. We show that our method achieves over a 40% peak memory reduction and a speed up ranging between 12% to 79% compared to BigRePair when compressing the largest input texts in all experiments. is made publicly available under the GNU public license here: https://github.com/jkim210/Recursive-RePair.

摘要

RePair压缩算法通过迭代地用一个新符号替换最频繁出现的连续符号对,直到所有连续符号对在压缩文本中只出现一次,从而生成一个上下文无关语法。它广泛应用于生物信息学、机器学习和信息检索等需要随机访问原始输入文本的场景。例如,在泛基因组学中,RePair用于随机访问一组基因组。BigRePair通过在构建RePair语法之前使用无前缀解析(PFP)对文本进行预处理,提高了原始RePair算法的可扩展性。尽管PFP在重复文本上效率很高,但解析大小存在可扩展性问题,这在BigRePair中导致了内存瓶颈。在本文中,我们设计并实现了递归RePair(表示为 ),它使用递归PFP构建RePair语法。我们的新算法面临着在不直接访问文本解析的情况下构建RePair语法的挑战,仅依赖于文本字典以及文本解析的解析和字典。我们使用SARS-CoV-2单倍型和千人基因组计划的单倍型将 与BigRePair进行比较。我们表明,在所有实验中压缩最大输入文本时,与BigRePair相比,我们的方法 实现了超过40%的峰值内存减少,加速比在12%到79%之间。 根据GNU公共许可证在此处公开提供:https://github.com/jkim210/Recursive-RePair。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/b3b303b2d836/nihpp-2024.07.11.603142v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/c45f4338ea80/nihpp-2024.07.11.603142v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/8bf4f46d61d0/nihpp-2024.07.11.603142v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/b722d8d4382a/nihpp-2024.07.11.603142v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/4731dc481e77/nihpp-2024.07.11.603142v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/ab72758392e3/nihpp-2024.07.11.603142v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/b3b303b2d836/nihpp-2024.07.11.603142v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/c45f4338ea80/nihpp-2024.07.11.603142v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/8bf4f46d61d0/nihpp-2024.07.11.603142v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/b722d8d4382a/nihpp-2024.07.11.603142v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/4731dc481e77/nihpp-2024.07.11.603142v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/ab72758392e3/nihpp-2024.07.11.603142v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cba/11275962/b3b303b2d836/nihpp-2024.07.11.603142v1-f0003.jpg

相似文献

1
RePair: Increasing the Scalability of RePair by Decreasing Memory Usage.RePair:通过减少内存使用来提高RePair的可扩展性。
bioRxiv. 2024 Jul 16:2024.07.11.603142. doi: 10.1101/2024.07.11.603142.
2
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
Proc Data Compress Conf. 2023 Mar;2023:62-70. Epub 2023 May 19.
3
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
bioRxiv. 2023 Jan 20:2023.01.18.524557. doi: 10.1101/2023.01.18.524557.
4
Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing.基于LZ77解析的AVL语法的快速且空间高效构建。
Lebniz Int Proc Inform. 2021;204. doi: 10.4230/LIPIcs.ESA.2021.56.
5
Prefix-free parsing for building big BWTs.用于构建大型Burrows-Wheeler变换(BWT)的无前缀解析
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.
6
Computing the original eBWT faster, simpler, and with less memory.更快、更简单且占用更少内存地计算原始增强型Burrows-Wheeler变换。
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
7
Building a pangenome alignment index via recursive prefix-free parsing.通过递归无前缀解析构建泛基因组比对索引。
iScience. 2024 Sep 12;27(10):110933. doi: 10.1016/j.isci.2024.110933. eCollection 2024 Oct 18.
8
Pfp-fm: an accelerated FM-index.Pfp-fm:一种加速的FM索引。
Algorithms Mol Biol. 2024 Apr 10;19(1):15. doi: 10.1186/s13015-024-00260-8.
9
PFP-FM: An Accelerated FM-index.PFP-FM:一种加速的FM索引。
Res Sq. 2023 Oct 30:rs.3.rs-3487536. doi: 10.21203/rs.3.rs-3487536/v1.
10
PFP Compressed Suffix Trees.PFP压缩后缀树
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.

本文引用的文献

1
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
Proc Data Compress Conf. 2023 Mar;2023:62-70. Epub 2023 May 19.
2
The complete sequence of a human Y chromosome.人类 Y 染色体的完整序列。
Nature. 2023 Sep;621(7978):344-354. doi: 10.1038/s41586-023-06457-y. Epub 2023 Aug 23.
3
SPUMONI 2: improved classification using a pangenome index of minimizer digests.SPUMONI 2:使用最小化消化物的泛基因组指数进行改进分类。
Genome Biol. 2023 May 18;24(1):122. doi: 10.1186/s13059-023-02958-1.
4
High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios.对扩展的 1000 基因组项目队列进行高覆盖率全基因组测序,包括 602 个三核苷酸重复序列。
Cell. 2022 Sep 1;185(18):3426-3440.e19. doi: 10.1016/j.cell.2022.08.004.
5
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.
6
Sustainable data analysis with Snakemake.使用 Snakemake 进行可持续数据分析。
F1000Res. 2021 Jan 18;10:33. doi: 10.12688/f1000research.29032.2. eCollection 2021.
7
Prefix-free parsing for building big BWTs.用于构建大型Burrows-Wheeler变换(BWT)的无前缀解析
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.