Levallois Victor, Shibuya Yoshihiro, Le Gal Bertrand, Patro Rob, Peterlongo Pierre, Ermanno Pibiri Giulio
GenScale, University of Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France.
Sequence Bioinformatics Unit, Institut Pasteur, F-75015, Paris, France.
bioRxiv. 2025 May 21:2025.05.16.654317. doi: 10.1101/2025.05.16.654317.
The problem of identifying the set of textual documents from a given database containing a query string has been studied in various fields of computing, e.g., in Information Retrieval, Databases, and Computational Biology. We consider the approximate version of this problem, that is, the result set is allowed to contain some false positive matches (but no false negatives), and focus on the specific case where the indexed documents are DNA strings. In this setting, state-of-the-art solutions rely on Bloom filters as a way to index all -mers (substrings of length ) in the documents. To answer a query, the -mers of the query string are tested for membership against the index and documents that contain at least a user-prescribed fraction of them (e.g., 75-80%) are returned.
Here, we explore an alternative index design based on -mer minimizers and integer compression methods. We show that a careful implementation of this design outperforms previous solutions based on Bloom filters by a wide margin: the index has lower memory footprint and faster query times, while false positive matches have only a minor impact on the ranking of the documents reported. This trend is robust across genomic datasets of different complexity and query workloads.
The software is implemented in C++17 and available under the MIT license at github.com/yhhshb/kaminari. Reproducibility information and additional results are provided at github.com/vicLeva/benchmarks_kaminari.
在计算的各个领域,如信息检索、数据库和计算生物学中,都研究了从给定数据库中识别包含查询字符串的文本文件集的问题。我们考虑这个问题的近似版本,即结果集允许包含一些误报匹配(但不包含漏报),并专注于索引文档为DNA字符串的特定情况。在这种情况下,目前的先进解决方案依赖布隆过滤器来索引文档中的所有k - 聚体(长度为k的子串)。为了回答查询,会针对索引测试查询字符串的k - 聚体的成员资格,并返回包含至少用户规定比例(例如75 - 80%)的k - 聚体的文档。
在这里,我们探索了一种基于k - 聚体最小化器和整数压缩方法的替代索引设计。我们表明,这种设计的精心实现比基于布隆过滤器的先前解决方案有很大优势:索引占用的内存更少,查询时间更快,而误报匹配对报告的文档排名只有轻微影响。这种趋势在不同复杂度的基因组数据集和查询工作负载中都很稳健。
该软件用C++17实现,可在github.com/yhhshb/kaminari上根据MIT许可获取。在github.com/vicLeva/benchmarks_kaminari上提供了可重复性信息和其他结果。