• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PFP压缩后缀树

PFP Compressed Suffix Trees.

作者信息

Boucher Christina, Cvacho Ondřej, Gagie Travis, Holub Jan, Manzini Giovanni, Navarro Gonzalo, Rossi Massimiliano

机构信息

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA.

Department of Theoretical Computer Science, Czech Technical University in Prague, Czech Republic.

出版信息

Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.

DOI:10.1137/1.9781611976472.5
PMID:35355938
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8963198/
Abstract

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string , it produces a dictionary and a parse of overlapping phrases such that BWT() can be computed from and in time and workspace bounded in terms of their combined size |PFP()|. In practice and are significantly smaller than and computing BWT() from them is more efficient than computing it from directly, at least when is the concatenation of many genomes. In this paper, we consider PFP() as a and show how it can be augmented to support still built and fitting within (|PFP()|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in ; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

摘要

无前缀解析(PFP)由布歇等人(2019年)提出,作为一种预处理步骤,以简化基因组数据库的Burrows-Wheeler变换(BWT)的计算。给定一个字符串,它会生成一个字典和一个重叠短语的解析,使得可以根据和在其组合大小|PFP()|所界定的时间和工作空间内计算出BWT()。在实践中,和明显小于,并且从它们计算BWT()比直接从计算更有效,至少当是许多基因组的串联时。在本文中,我们将PFP()视为一种,并展示如何对其进行扩展以支持仍然在(|PFP()|)空间内构建和适配。这需要高效计算各种原语来模拟后缀树:计算中两个位置的最长公共扩展(LCE);读取其后缀数组(SA)、其逆(ISA)、其BWT及其最长公共前缀数组(LCP)的任何单元格;以及计算范围内最小值和LCP上的下一个/上一个较小值查询。我们的实验结果表明,对于非常大的重复数据集,可以有效地构建PFP后缀树,并且其操作与只能处理小得多的数据集的其他压缩后缀树相比具有竞争力。

相似文献

1
PFP Compressed Suffix Trees.PFP压缩后缀树
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.
2
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
bioRxiv. 2023 Jan 20:2023.01.18.524557. doi: 10.1101/2023.01.18.524557.
3
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
Proc Data Compress Conf. 2023 Mar;2023:62-70. Epub 2023 May 19.
4
Prefix-free parsing for building big BWTs.用于构建大型Burrows-Wheeler变换(BWT)的无前缀解析
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.
5
Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.突破压缩后缀数组和后缀树构建中的-障碍
Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.
6
External memory BWT and LCP computation for sequence collections with applications.用于序列集合的外部内存BWT和LCP计算及其应用
Algorithms Mol Biol. 2019 Mar 8;14:6. doi: 10.1186/s13015-019-0140-0. eCollection 2019.
7
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.gsufsort:为字符串集合构建后缀数组、最长公共前缀数组和Burrows-Wheeler变换
Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.
8
Computing the original eBWT faster, simpler, and with less memory.更快、更简单且占用更少内存地计算原始增强型Burrows-Wheeler变换。
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
9
Computing matching statistics on Wheeler DFAs.计算惠勒确定有限自动机上的匹配统计信息。
Proc Data Compress Conf. 2023 Mar;2023:150-159. doi: 10.1109/dcc55655.2023.00023. Epub 2023 May 19.
10
r-indexing the eBWT.对增强型Burrows-Wheeler变换进行r索引
Int Symp String Process Inf Retr. 2021 Oct;12944:3-12. doi: 10.1007/978-3-030-86692-1_1. Epub 2021 Sep 27.

引用本文的文献

1
Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.突破压缩后缀数组和后缀树构建中的-障碍
Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.
2
An Upper Bound and Linear-Space Queries on the LZ-End Parsing.关于LZ-末端解析的一个上界和线性空间查询
Proc Annu ACM SIAM Symp Discret Algorithms. 2022;2022:2847-2866. doi: 10.1137/1.9781611977073.111.
3
CSTs for Terabyte-Sized Data.用于太字节级数据的CST
Proc Data Compress Conf. 2022 Mar;2022:93-102. doi: 10.1109/dcc52660.2022.00017. Epub 2022 Jul 4.
4
A survey of BWT variants for string collections.针对字符串集合的BWT变体调查。
Bioinformatics. 2024 May 24;40(7). doi: 10.1093/bioinformatics/btae333.
5
Computing the original eBWT faster, simpler, and with less memory.更快、更简单且占用更少内存地计算原始增强型Burrows-Wheeler变换。
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
6
Computational graph pangenomics: a tutorial on data structures and their applications.计算图泛基因组学:数据结构及其应用教程
Nat Comput. 2022 Mar;21(1):81-108. doi: 10.1007/s11047-022-09882-6. Epub 2022 Mar 4.
7
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.

本文引用的文献

1
Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.高效构建全基因组读段比对的完整索引。
J Comput Biol. 2020 Apr;27(4):500-513. doi: 10.1089/cmb.2019.0309. Epub 2020 Mar 16.
2
Prefix-free parsing for building big BWTs.用于构建大型Burrows-Wheeler变换(BWT)的无前缀解析
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.
3
The Public Health Impact of a Publically Available, Environmental Database of Microbial Genomes.一个公开可用的微生物基因组环境数据库对公共卫生的影响。
Front Microbiol. 2017 May 9;8:808. doi: 10.3389/fmicb.2017.00808. eCollection 2017.
4
Computational pan-genomics: status, promises and challenges.计算泛基因组学:现状、前景与挑战。
Brief Bioinform. 2018 Jan 1;19(1):118-135. doi: 10.1093/bib/bbw089.
5
A global reference for human genetic variation.人类遗传变异的全球参考。
Nature. 2015 Oct 1;526(7571):68-74. doi: 10.1038/nature15393.
6
Storage and retrieval of highly repetitive sequence collections.高度重复序列集合的存储与检索。
J Comput Biol. 2010 Mar;17(3):281-308. doi: 10.1089/cmb.2009.0169.