• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

TetRex:一种用于高度保守基序索引加速搜索的新算法。

TetRex: a novel algorithm for index-accelerated search of highly conserved motifs.

作者信息

Schwab Remy M, Gottlieb Simon Gene, Reinert Knut

机构信息

Max-Planck-Institute for Molecular Genetics, Ihnestrasse 63, 14195 Berlin, Germany.

Algorithmic Bioinformatics, Freie Universität Berlin, Takustrasse 9, 14195 Berlin, Germany.

出版信息

NAR Genom Bioinform. 2025 Apr 17;7(2):lqaf039. doi: 10.1093/nargab/lqaf039. eCollection 2025 Jun.

DOI:10.1093/nargab/lqaf039
PMID:40248489
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12004226/
Abstract

The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.

摘要

现代数据集的规模使得必须进行创新,以解决哪怕是最传统、最基本的计算问题。集合成员关系和集合基数都是简单查询的示例,对于足够大的数据集,使用传统方法很快就会变得具有挑战性。有趣的是,我们在生物学领域发现了对这些创新的需求。尽管生物有机体之间存在巨大差异,但仍存在对生命至关重要的功能,这些功能在看似不相关的物种的基因组和蛋白质组中得以保留。正则表达式(regexes)可以作为一种方便的方式来紧凑地表示这些保守序列。然而,尽管有强大的理论基础和可用工具的成熟度,但目前最先进的正则表达式搜索仍无法满足快速扫描庞大生物序列数据库的需求。在这项工作中,我们描述了一种新颖的正则表达式搜索算法,该算法通过利用最近开发的用于集合成员关系的紧凑数据结构——分层交错布隆过滤器来减少搜索空间。我们表明,我们的方法与传统搜索相结合的运行时性能优于目前最先进的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/fe9278fefc41/lqaf039fig15.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/3440bb53514e/lqaf039fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/6e92d83ff259/lqaf039fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/bf671bdc4174/lqaf039fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/a69576a1a280/lqaf039fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/888b3809ef35/lqaf039fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/fb9cb4a31ac2/lqaf039fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/b52ff000c1d4/lqaf039fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/254891738cf2/lqaf039fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/ada1e53df4c8/lqaf039fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/9be51593dad2/lqaf039fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/400288c1ce8a/lqaf039fig11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/3362c9c5d20d/lqaf039fig12.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/a9f4ca9bd49d/lqaf039fig13.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/d7347a69de50/lqaf039fig14.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/fe9278fefc41/lqaf039fig15.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/3440bb53514e/lqaf039fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/6e92d83ff259/lqaf039fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/bf671bdc4174/lqaf039fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/a69576a1a280/lqaf039fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/888b3809ef35/lqaf039fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/fb9cb4a31ac2/lqaf039fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/b52ff000c1d4/lqaf039fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/254891738cf2/lqaf039fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/ada1e53df4c8/lqaf039fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/9be51593dad2/lqaf039fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/400288c1ce8a/lqaf039fig11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/3362c9c5d20d/lqaf039fig12.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/a9f4ca9bd49d/lqaf039fig13.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/d7347a69de50/lqaf039fig14.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/12004226/fe9278fefc41/lqaf039fig15.jpg

相似文献

1
TetRex: a novel algorithm for index-accelerated search of highly conserved motifs.TetRex:一种用于高度保守基序索引加速搜索的新算法。
NAR Genom Bioinform. 2025 Apr 17;7(2):lqaf039. doi: 10.1093/nargab/lqaf039. eCollection 2025 Jun.
2
Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries.分层交错布隆过滤器:实现超快速、近似的序列查询。
Genome Biol. 2023 May 31;24(1):131. doi: 10.1186/s13059-023-02971-4.
3
Efficient sequential and parallel algorithms for finding edit distance based motifs.用于查找基于编辑距离的基序的高效顺序和并行算法。
BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.
4
EMS3: An Improved Algorithm for Finding Edit-Distance Based Motifs.EMS3:一种用于寻找基于编辑距离的基序的改进算法。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):27-37. doi: 10.1109/TCBB.2020.3024222. Epub 2021 Feb 3.
5
RefSelect: a reference sequence selection algorithm for planted (l, d) motif search.RefSelect:一种用于植入(l,d)基序搜索的参考序列选择算法。
BMC Bioinformatics. 2016 Jul 19;17 Suppl 9(Suppl 9):266. doi: 10.1186/s12859-016-1130-6.
6
HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.HMMerThread:通过将宽松的序列数据库搜索与折叠识别相结合,在整个基因组中检测远程、功能保守的结构域。
PLoS One. 2011 Mar 10;6(3):e17568. doi: 10.1371/journal.pone.0017568.
7
Identifying discriminative classification-based motifs in biological sequences.鉴定生物序列中基于分类的判别基序。
Bioinformatics. 2011 May 1;27(9):1231-8. doi: 10.1093/bioinformatics/btr110. Epub 2011 Mar 3.
8
A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs.基于蒙特卡罗的框架增强了调控序列基序的发现和解释。
BMC Bioinformatics. 2012 Nov 27;13:317. doi: 10.1186/1471-2105-13-317.
9
Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space.使用判别目标函数和动态搜索空间识别预测性顺式调控元件。
PLoS One. 2015 Oct 14;10(10):e0140557. doi: 10.1371/journal.pone.0140557. eCollection 2015.
10
A generic motif discovery algorithm for sequential data.一种用于序列数据的通用基序发现算法。
Bioinformatics. 2006 Jan 1;22(1):21-8. doi: 10.1093/bioinformatics/bti745. Epub 2005 Oct 27.

本文引用的文献

1
Lambda3: homology search for protein, nucleotide, and bisulfite-converted sequences.Lambda3:蛋白质、核苷酸和亚硫酸氢盐转化序列的同源性搜索。
Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae097.
2
Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries.分层交错布隆过滤器:实现超快速、近似的序列查询。
Genome Biol. 2023 May 31;24(1):131. doi: 10.1186/s13059-023-02971-4.
3
JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles.JASPAR 2022:转录因子结合谱开放获取数据库的第 9 个版本。
Nucleic Acids Res. 2022 Jan 7;50(D1):D165-D173. doi: 10.1093/nar/gkab1113.
4
Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。
Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.
5
Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.猛禽:一种用于查询超大型核苷酸序列集合的快速且节省空间的预过滤器。
iScience. 2021 Jun 24;24(7):102782. doi: 10.1016/j.isci.2021.102782. eCollection 2021 Jul 23.
6
Sensitive protein alignments at tree-of-life scale using DIAMOND.使用 DIAMOND 进行生命之树尺度上的敏感蛋白质比对。
Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7.
7
Minimap2: pairwise alignment for nucleotide sequences.Minimap2:核苷酸序列的两两比对。
Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.
8
Reducing storage requirements for biological sequence comparison.减少生物序列比对的存储需求。
Bioinformatics. 2004 Dec 12;20(18):3363-9. doi: 10.1093/bioinformatics/bth408. Epub 2004 Jul 15.
9
ScanProsite: a reference implementation of a PROSITE scanning tool.ScanProsite:PROSITE扫描工具的参考实现。
Appl Bioinformatics. 2002;1(2):107-8.
10
Reduction of protein sequence complexity by residue grouping.通过残基分组降低蛋白质序列复杂性。
Protein Eng. 2003 May;16(5):323-30. doi: 10.1093/protein/gzg044.