外部内存中的广义增强后缀数组构造

Generalized enhanced suffix array construction in external memory.

作者信息

Louza Felipe A, Telles Guilherme P, Hoffmann Steve, Ciferri Cristina D A

机构信息

Department of Computing and Mathematics, University of São Paulo, Av. Bandeirantes, 3900, Ribeirão Preto, 14040-901 Brazil.

Institute of Computing, University of Campinas, Av. Albert Einstein, 1251, Campinas, 13083-852 Brazil.

出版信息

Algorithms Mol Biol. 2017 Dec 7;12:26. doi: 10.1186/s13015-017-0117-9. eCollection 2017.

DOI:10.1186/s13015-017-0117-9

PMID:29234460

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5719966/

Abstract

BACKGROUND

Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory.

RESULTS

In this article we present and analyze [Formula: see text] [introduced in CPM (External memory generalized suffix and [Formula: see text] arrays construction. In: Proceedings of CPM. pp 201-10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, [Formula: see text] showed a competitive performance when compared to [Formula: see text] and [Formula: see text], which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm.

CONCLUSIONS

The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time.

摘要

背景

后缀数组通过附加的数据结构得到增强，能够高效地解决许多字符串处理问题。当输入集合或数据结构的大小超过可用的内部内存时，构建字符串集合的广义后缀数组的外部内存算法是一项基本任务。

结果

在本文中，我们提出并分析了[公式：见正文][在CPM（外部内存广义后缀和[公式：见正文]数组构建。见：CPM会议论文集。第201 - 10页，2013年）中引入]，这是第一个用于构建为字符串集合增强了最长公共前缀数组的广义后缀数组的外部内存算法。我们的算法依靠缓冲区、诱导排序和堆的组合来避免直接的字符串比较。我们进行了涵盖算法不同方面的实验，包括运行时间、效率、外部内存访问、内部阶段以及不同优化策略的影响。在大小达24GB且使用2GB内部内存的真实数据集上，[公式：见正文]与[公式：见正文]和[公式：见正文]相比表现出了有竞争力的性能，根据相关文献，后两者是针对单个字符串的高效算法。我们还展示了操作系统管理的磁盘缓存对我们算法的影响。

结论

所提出的算法通过使用来自不同领域的真实数据集以各种组合进行性能测试得到了验证，并表现出有竞争力的性能。我们的算法还能够构建字符串集合的广义布罗特 - 惠勒变换，除了输出时间外无需额外成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a76/5719966/a98edc15907a/13015_2017_117_Fig1_HTML.jpg

相似文献

Generalized enhanced suffix array construction in external memory.

Algorithms Mol Biol. 2017 Dec 7;12:26. doi: 10.1186/s13015-017-0117-9. eCollection 2017.

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.

Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.

External memory BWT and LCP computation for sequence collections with applications.

Algorithms Mol Biol. 2019 Mar 8;14:6. doi: 10.1186/s13015-019-0140-0. eCollection 2019.

An Elegant Algorithm for the Construction of Suffix Arrays.

J Discrete Algorithms (Amst). 2014 Jul 1;27:21-28. doi: 10.1016/j.jda.2014.03.001.

Suffix sorting via matching statistics.

Algorithms Mol Biol. 2024 Mar 12;19(1):11. doi: 10.1186/s13015-023-00245-z.

Computing the original eBWT faster, simpler, and with less memory.

Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.

Linear-time computation of minimal absent words using suffix array.

BMC Bioinformatics. 2014 Dec 20;15(1):388. doi: 10.1186/s12859-014-0388-9.

Multithread Multistring Burrows-Wheeler Transform and Longest Common Prefix Array.

J Comput Biol. 2019 Sep;26(9):948-961. doi: 10.1089/cmb.2018.0230. Epub 2019 May 29.

Encoded expansion: an efficient algorithm to discover identical string motifs.

PLoS One. 2014 May 28;9(5):e95148. doi: 10.1371/journal.pone.0095148. eCollection 2014.

PFP Compressed Suffix Trees.

Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.

引用本文的文献

A survey of BWT variants for string collections.

Bioinformatics. 2024 May 24;40(7). doi: 10.1093/bioinformatics/btae333.

Computing the original eBWT faster, simpler, and with less memory.

Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.

Metagenomic analysis through the extended Burrows-Wheeler transform.

BMC Bioinformatics. 2020 Sep 16;21(Suppl 8):299. doi: 10.1186/s12859-020-03628-w.

External memory BWT and LCP computation for sequence collections with applications.

Algorithms Mol Biol. 2019 Mar 8;14:6. doi: 10.1186/s13015-019-0140-0. eCollection 2019.

SNPs detection by eBWT positional clustering.

Algorithms Mol Biol. 2019 Feb 6;14:3. doi: 10.1186/s13015-019-0137-8. eCollection 2019.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

外部内存中的广义增强后缀数组构造

Generalized enhanced suffix array construction in external memory.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献