TetRex：一种用于高度保守基序索引加速搜索的新算法。

TetRex: a novel algorithm for index-accelerated search of highly conserved motifs.

作者信息

Schwab Remy M, Gottlieb Simon Gene, Reinert Knut

机构信息

Max-Planck-Institute for Molecular Genetics, Ihnestrasse 63, 14195 Berlin, Germany.

Algorithmic Bioinformatics, Freie Universität Berlin, Takustrasse 9, 14195 Berlin, Germany.

出版信息

NAR Genom Bioinform. 2025 Apr 17;7(2):lqaf039. doi: 10.1093/nargab/lqaf039. eCollection 2025 Jun.

DOI:10.1093/nargab/lqaf039

PMID:40248489

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12004226/

Abstract

The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.

摘要

现代数据集的规模使得必须进行创新，以解决哪怕是最传统、最基本的计算问题。集合成员关系和集合基数都是简单查询的示例，对于足够大的数据集，使用传统方法很快就会变得具有挑战性。有趣的是，我们在生物学领域发现了对这些创新的需求。尽管生物有机体之间存在巨大差异，但仍存在对生命至关重要的功能，这些功能在看似不相关的物种的基因组和蛋白质组中得以保留。正则表达式（regexes）可以作为一种方便的方式来紧凑地表示这些保守序列。然而，尽管有强大的理论基础和可用工具的成熟度，但目前最先进的正则表达式搜索仍无法满足快速扫描庞大生物序列数据库的需求。在这项工作中，我们描述了一种新颖的正则表达式搜索算法，该算法通过利用最近开发的用于集合成员关系的紧凑数据结构——分层交错布隆过滤器来减少搜索空间。我们表明，我们的方法与传统搜索相结合的运行时性能优于目前最先进的工具。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

TetRex：一种用于高度保守基序索引加速搜索的新算法。

TetRex: a novel algorithm for index-accelerated search of highly conserved motifs.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

TetRex：一种用于高度保守基序索引加速搜索的新算法。

TetRex: a novel algorithm for index-accelerated search of highly conserved motifs.

作者信息

机构信息

出版信息

相似文献

本文引用的文献