• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

合并 Trie:高效的文本索引。

MergedTrie: Efficient textual indexing.

机构信息

GPLSI Research Group, Department of Software and Computing Systems, University of Alicante, Alicante, Spain.

Lucentia Research Group, Department of Software and Computing Systems, University of Alicante, Alicante, Spain.

出版信息

PLoS One. 2019 Apr 23;14(4):e0215288. doi: 10.1371/journal.pone.0215288. eCollection 2019.

DOI:10.1371/journal.pone.0215288
PMID:31013282
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6478299/
Abstract

The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed.

摘要

文本信息的访问和处理(即一组字符串的存储和查询)对于许多当前的应用程序(例如信息检索和社交网络)非常重要,特别是在大数据或物联网领域工作时,需要处理非常大的字符串字典。文本索引的典型数据结构是哈希表和一些尝试的变体,例如双尝试(DT)。在本文中,我们提出了 DT 的扩展,我们称之为合并尝试(MergedTrie)。它通过将两个尝试合并到一个尝试中,并通过将索引项分割成两个固定长度的部分来平衡新的尝试,从而提高了 DT 的压缩率。因此,获得了更高的前缀和后缀重叠。此外,我们提出了一种新的尝试实现,比通常用于实现尝试的双数组表示实现了更好的压缩率。我们的提议还克服了静态实现的限制,该限制不允许在其紧凑表示中进行插入和更新。最后,我们的合并尝试实现实验性地提高了哈希表、DT、双数组、Crit-bit、有向无环字图(DAWG)和确定有限自动机(ADFA)数据结构的效率,需要的空间比要索引的原始文本少。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b27/6478299/05adc3f2d791/pone.0215288.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b27/6478299/dc0b09a45f3e/pone.0215288.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b27/6478299/5cbf1899afbd/pone.0215288.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b27/6478299/05adc3f2d791/pone.0215288.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b27/6478299/dc0b09a45f3e/pone.0215288.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b27/6478299/5cbf1899afbd/pone.0215288.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b27/6478299/05adc3f2d791/pone.0215288.g003.jpg

相似文献

1
MergedTrie: Efficient textual indexing.合并 Trie:高效的文本索引。
PLoS One. 2019 Apr 23;14(4):e0215288. doi: 10.1371/journal.pone.0215288. eCollection 2019.
2
Data structure set-trie for storing and querying sets: Theoretical and empirical analysis.用于存储和查询集合的数据结构集合前缀树:理论与实证分析。
PLoS One. 2021 Feb 10;16(2):e0245122. doi: 10.1371/journal.pone.0245122. eCollection 2021.
3
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.
4
SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment.SecDedoop:HDFS/Hadoop 环境中具有大数据访问控制的安全去重。
Big Data. 2020 Apr;8(2):147-163. doi: 10.1089/big.2019.0120.
5
A novel biomedical image indexing and retrieval system via deep preference learning.一种基于深度偏好学习的新型生物医学图像索引和检索系统。
Comput Methods Programs Biomed. 2018 May;158:53-69. doi: 10.1016/j.cmpb.2018.02.003. Epub 2018 Feb 6.
6
Efficient protein structure search using indexing methods.利用索引方法进行高效的蛋白质结构搜索。
BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S8. doi: 10.1186/1472-6947-13-s1-s8. Epub 2013 Apr 5.
7
Font adaptive word indexing of modern printed documents.现代印刷文档的字体自适应词索引
IEEE Trans Pattern Anal Mach Intell. 2006 Aug;28(8):1187-99. doi: 10.1109/TPAMI.2006.162.
8
GRAPES-DD: exploiting decision diagrams for index-driven search in biological graph databases.GRAPES-DD:利用决策图进行生物图谱数据库中的索引驱动搜索。
BMC Bioinformatics. 2021 Apr 22;22(1):209. doi: 10.1186/s12859-021-04129-0.
9
Bitpacking techniques for indexing genomes: I. Hash tables.用于基因组索引的位包装技术:I. 哈希表
Algorithms Mol Biol. 2016 Apr 18;11:5. doi: 10.1186/s13015-016-0069-5. eCollection 2016.
10
Modeling semantic aspects for cross-media image indexing.跨媒体图像索引的语义方面建模
IEEE Trans Pattern Anal Mach Intell. 2007 Oct;29(10):1802-17. doi: 10.1109/TPAMI.2007.1097.

引用本文的文献

1
Correction: MergedTrie: Efficient textual indexing.更正:合并前缀树:高效文本索引。
PLoS One. 2019 May 31;14(5):e0217958. doi: 10.1371/journal.pone.0217958. eCollection 2019.

本文引用的文献

1
Internet of Things: A Review of Surveys Based on Context Aware Intelligent Services.物联网:基于上下文感知智能服务的综述
Sensors (Basel). 2016 Jul 11;16(7):1069. doi: 10.3390/s16071069.
2
B-HIT - A Tool for Harvesting and Indexing Biodiversity Data.B-HIT - 一种用于收集和索引生物多样性数据的工具。
PLoS One. 2015 Nov 6;10(11):e0142240. doi: 10.1371/journal.pone.0142240. eCollection 2015.
3
Text mining for literature review and knowledge discovery in cancer risk assessment and research.文本挖掘在癌症风险评估和研究中的文献综述和知识发现。
PLoS One. 2012;7(4):e33427. doi: 10.1371/journal.pone.0033427. Epub 2012 Apr 12.