GPLSI Research Group, Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
Lucentia Research Group, Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
PLoS One. 2019 Apr 23;14(4):e0215288. doi: 10.1371/journal.pone.0215288. eCollection 2019.
The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed.
文本信息的访问和处理(即一组字符串的存储和查询)对于许多当前的应用程序(例如信息检索和社交网络)非常重要,特别是在大数据或物联网领域工作时,需要处理非常大的字符串字典。文本索引的典型数据结构是哈希表和一些尝试的变体,例如双尝试(DT)。在本文中,我们提出了 DT 的扩展,我们称之为合并尝试(MergedTrie)。它通过将两个尝试合并到一个尝试中,并通过将索引项分割成两个固定长度的部分来平衡新的尝试,从而提高了 DT 的压缩率。因此,获得了更高的前缀和后缀重叠。此外,我们提出了一种新的尝试实现,比通常用于实现尝试的双数组表示实现了更好的压缩率。我们的提议还克服了静态实现的限制,该限制不允许在其紧凑表示中进行插入和更新。最后,我们的合并尝试实现实验性地提高了哈希表、DT、双数组、Crit-bit、有向无环字图(DAWG)和确定有限自动机(ADFA)数据结构的效率,需要的空间比要索引的原始文本少。