• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

Document retrieval on repetitive string collections.

作者信息

Gagie Travis, Hartikainen Aleksi, Karhu Kalle, Kärkkäinen Juha, Navarro Gonzalo, Puglisi Simon J, Sirén Jouni

机构信息

CeBiB - Center of Biotechnology and Bioengineering, School of Computer Science and Telecommunications, Diego Portales University, Santiago, Chile.

Google Inc, Mountain View, CA USA.

出版信息

Inf Retr Boston. 2017;20(3):253-291. doi: 10.1007/s10791-017-9297-7. Epub 2017 Apr 1.

DOI:10.1007/s10791-017-9297-7
PMID:28596702
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5445192/
Abstract

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, and , that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top- document retrieval (find the documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

摘要
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/1e96b84ff7c8/10791_2017_9297_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/ff7239258014/10791_2017_9297_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/2a8260f8480e/10791_2017_9297_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/0a938a687195/10791_2017_9297_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/46aba512ac79/10791_2017_9297_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/3179b489b634/10791_2017_9297_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/e183eac38cab/10791_2017_9297_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/b2e2ac752072/10791_2017_9297_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/4690438790e1/10791_2017_9297_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/a597b7d3f6f5/10791_2017_9297_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/9c75de61b1dd/10791_2017_9297_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/1e96b84ff7c8/10791_2017_9297_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/ff7239258014/10791_2017_9297_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/2a8260f8480e/10791_2017_9297_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/0a938a687195/10791_2017_9297_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/46aba512ac79/10791_2017_9297_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/3179b489b634/10791_2017_9297_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/e183eac38cab/10791_2017_9297_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/b2e2ac752072/10791_2017_9297_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/4690438790e1/10791_2017_9297_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/a597b7d3f6f5/10791_2017_9297_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/9c75de61b1dd/10791_2017_9297_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13a4/5445192/1e96b84ff7c8/10791_2017_9297_Fig11_HTML.jpg

相似文献

1
Document retrieval on repetitive string collections.
Inf Retr Boston. 2017;20(3):253-291. doi: 10.1007/s10791-017-9297-7. Epub 2017 Apr 1.
2
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.gsufsort:为字符串集合构建后缀数组、最长公共前缀数组和Burrows-Wheeler变换
Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.
3
Storage and retrieval of highly repetitive sequence collections.高度重复序列集合的存储与检索。
J Comput Biol. 2010 Mar;17(3):281-308. doi: 10.1089/cmb.2009.0169.
4
Relative Suffix Trees.
Comput J. 2018 May;61(5):773-788. doi: 10.1093/comjnl/bxx108. Epub 2017 Nov 21.
5
Font adaptive word indexing of modern printed documents.现代印刷文档的字体自适应词索引
IEEE Trans Pattern Anal Mach Intell. 2006 Aug;28(8):1187-99. doi: 10.1109/TPAMI.2006.162.
6
Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.突破压缩后缀数组和后缀树构建中的-障碍
Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.
7
Generalized enhanced suffix array construction in external memory.外部内存中的广义增强后缀数组构造
Algorithms Mol Biol. 2017 Dec 7;12:26. doi: 10.1186/s13015-017-0117-9. eCollection 2017.
8
Text-based multi-dimensional medical images retrieval according to the features-usage correlation.基于特征-用法相关性的文本型多维医学图像检索。
Med Biol Eng Comput. 2021 Oct;59(10):1993-2017. doi: 10.1007/s11517-021-02392-0. Epub 2021 Aug 20.
9
A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。
Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.
10
LADER: Log-Augmented DEnse Retrieval for Biomedical Literature Search.LADER:用于生物医学文献搜索的对数增强密集检索
ArXiv. 2023 Apr 10:arXiv:2304.04590v1.

引用本文的文献

1
Lossless Pangenome Indexing Using Tag Arrays.使用标签数组的无损全基因组索引
bioRxiv. 2025 May 15:2025.05.12.653561. doi: 10.1101/2025.05.12.653561.
2
MicroRNA hsa-mir-3923 serves as a diagnostic and prognostic biomarker for gastric carcinoma.微小 RNA hsa-mir-3923 可作为胃癌的诊断和预后生物标志物。
Sci Rep. 2020 Mar 13;10(1):4672. doi: 10.1038/s41598-020-61633-8.

本文引用的文献

1
Computational pan-genomics: status, promises and challenges.计算泛基因组学:现状、前景与挑战。
Brief Bioinform. 2018 Jan 1;19(1):118-135. doi: 10.1093/bib/bbw089.
2
Big Data: Astronomical or Genomical?大数据:天文学的还是基因组学的?
PLoS Biol. 2015 Jul 7;13(7):e1002195. doi: 10.1371/journal.pbio.1002195. eCollection 2015 Jul.
3
Storage and retrieval of highly repetitive sequence collections.高度重复序列集合的存储与检索。
J Comput Biol. 2010 Mar;17(3):281-308. doi: 10.1089/cmb.2009.0169.