Suppr超能文献

外部内存中的广义增强后缀数组构造

Generalized enhanced suffix array construction in external memory.

作者信息

Louza Felipe A, Telles Guilherme P, Hoffmann Steve, Ciferri Cristina D A

机构信息

Department of Computing and Mathematics, University of São Paulo, Av. Bandeirantes, 3900, Ribeirão Preto, 14040-901 Brazil.

Institute of Computing, University of Campinas, Av. Albert Einstein, 1251, Campinas, 13083-852 Brazil.

出版信息

Algorithms Mol Biol. 2017 Dec 7;12:26. doi: 10.1186/s13015-017-0117-9. eCollection 2017.

Abstract

BACKGROUND

Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory.

RESULTS

In this article we present and analyze [Formula: see text] [introduced in CPM (External memory generalized suffix and [Formula: see text] arrays construction. In: Proceedings of CPM. pp 201-10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, [Formula: see text] showed a competitive performance when compared to [Formula: see text] and [Formula: see text], which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm.

CONCLUSIONS

The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time.

摘要

背景

后缀数组通过附加的数据结构得到增强,能够高效地解决许多字符串处理问题。当输入集合或数据结构的大小超过可用的内部内存时,构建字符串集合的广义后缀数组的外部内存算法是一项基本任务。

结果

在本文中,我们提出并分析了[公式:见正文][在CPM(外部内存广义后缀和[公式:见正文]数组构建。见:CPM会议论文集。第201 - 10页,2013年)中引入],这是第一个用于构建为字符串集合增强了最长公共前缀数组的广义后缀数组的外部内存算法。我们的算法依靠缓冲区、诱导排序和堆的组合来避免直接的字符串比较。我们进行了涵盖算法不同方面的实验,包括运行时间、效率、外部内存访问、内部阶段以及不同优化策略的影响。在大小达24GB且使用2GB内部内存的真实数据集上,[公式:见正文]与[公式:见正文]和[公式:见正文]相比表现出了有竞争力的性能,根据相关文献,后两者是针对单个字符串的高效算法。我们还展示了操作系统管理的磁盘缓存对我们算法的影响。

结论

所提出的算法通过使用来自不同领域的真实数据集以各种组合进行性能测试得到了验证,并表现出有竞争力的性能。我们的算法还能够构建字符串集合的广义布罗特 - 惠勒变换,除了输出时间外无需额外成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a76/5719966/a98edc15907a/13015_2017_117_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验