Schwab Remy M, Gottlieb Simon Gene, Reinert Knut
Max-Planck-Institute for Molecular Genetics, Ihnestrasse 63, 14195 Berlin, Germany.
Algorithmic Bioinformatics, Freie Universität Berlin, Takustrasse 9, 14195 Berlin, Germany.
NAR Genom Bioinform. 2025 Apr 17;7(2):lqaf039. doi: 10.1093/nargab/lqaf039. eCollection 2025 Jun.
The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.
现代数据集的规模使得必须进行创新,以解决哪怕是最传统、最基本的计算问题。集合成员关系和集合基数都是简单查询的示例,对于足够大的数据集,使用传统方法很快就会变得具有挑战性。有趣的是,我们在生物学领域发现了对这些创新的需求。尽管生物有机体之间存在巨大差异,但仍存在对生命至关重要的功能,这些功能在看似不相关的物种的基因组和蛋白质组中得以保留。正则表达式(regexes)可以作为一种方便的方式来紧凑地表示这些保守序列。然而,尽管有强大的理论基础和可用工具的成熟度,但目前最先进的正则表达式搜索仍无法满足快速扫描庞大生物序列数据库的需求。在这项工作中,我们描述了一种新颖的正则表达式搜索算法,该算法通过利用最近开发的用于集合成员关系的紧凑数据结构——分层交错布隆过滤器来减少搜索空间。我们表明,我们的方法与传统搜索相结合的运行时性能优于目前最先进的工具。