前向采样方案密度的一个近乎紧密的下界。

A near-tight lower bound on the density of forward sampling schemes.

作者信息

Kille Bryce, Groot Koerkamp Ragnar, McAdams Drake, Liu Alan, Treangen Todd J

机构信息

Department of Computer Science, Rice University, Houston, TX 77005, United States.

Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.

出版信息

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae736.

DOI:10.1093/bioinformatics/btae736

PMID:39666942

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11676336/

Abstract

MOTIVATION

Sampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e. have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two.

RESULTS

We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we observe that our bound is tight when k≡1(mod w). For large w and k, the bound can be approximated by 1w+k⌈w+kw⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al. is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k≡1(mod w) and the alphabet size σ goes to ∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound.

AVAILABILITY AND IMPLEMENTATION

Minimizer implementations: github.com/RagnarGrootKoerkamp/minimizers ILP and analysis: github.com/treangenlab/sampling-scheme-analysis.

摘要

动机

在序列分析算法中，对k-mer进行采样是一项普遍存在的任务。诸如常用的随机最小化器方案之类的采样方案特别有吸引力，因为它们保证从每w个连续的k-mer中至少选择一个k-mer。采样较少的k-mer通常会提高下游方法的效率。因此，开发具有低密度（即采样的k-mer比例小）的方案是一个活跃的研究领域。经过十多年在降低实际方案密度和提高最佳可能密度下限方面的持续努力，两者之间仍存在很大差距。

结果

我们证明了前向采样方案密度的一个近乎紧密的下限，前向采样方案是一类推广了最小化器方案的方案。对于较小的w和k，我们观察到当k≡1（mod w）时，我们的下限是紧密的。对于较大的w和k，该下限可以近似为1 / (w + k⌈(w + k) / w⌉)。重要的是，我们的下限意味着现有方案比以前已知的方案更接近实现最优密度。例如，在当前默认的minimap2 HiFi设置w = 19和k = 19下，我们表明对于这些参数，最著名的方案，即Pellow等人基于双去环集的最小化器，比最优方案最多密集3%，而之前的差距最多为50%。此外，当k≡1（mod w）且字母表大小σ趋于无穷大时，我们表明Groot Koerkamp和Pibiri引入的模最小化器实现了与我们下限匹配的最优密度。

可用性和实现

最小化器实现：github.com/RagnarGrootKoerkamp/minimizers 整数线性规划和分析：github.com/treangenlab/sampling-scheme-analysis

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f40d/11676336/aa1a51260f3c/btae736f1.jpg

相似文献

A near-tight lower bound on the density of forward sampling schemes.前向采样方案密度的一个近乎紧密的下界。

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae736.

A near-tight lower bound on the density of forward sampling schemes.前向采样方案密度的一个近乎紧密的下界。

bioRxiv. 2024 Nov 19:2024.09.06.611668. doi: 10.1101/2024.09.06.611668.

The open-closed mod-minimizer algorithm.开闭模极小化算法。

Algorithms Mol Biol. 2025 Mar 17;20(1):4. doi: 10.1186/s13015-025-00270-0.

Improved design and analysis of practical minimizers.实用极小化器的改进设计与分析。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.

Asymptotically optimal minimizers schemes.渐近最优极小化方案。

Bioinformatics. 2018 Jul 1;34(13):i13-i22. doi: 10.1093/bioinformatics/bty258.

Efficient minimizer orders for large values of using minimum decycling sets.利用最小去环集对大值进行有效最小化排序。

Genome Res. 2023 Jul;33(7):1154-1161. doi: 10.1101/gr.277644.123. Epub 2023 Aug 9.

A simple refined DNA minimizer operator enables 2-fold faster computation.一个简单的改进 DNA 简化操作符可以使计算速度提高 2 倍。

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae045.

Syncmers are more sensitive than minimizers for selecting conserved ‑mers in biological sequences.同步寡聚体在选择生物序列中的保守寡聚体方面比最小寡聚体更敏感。

PeerJ. 2021 Feb 5;9:e10805. doi: 10.7717/peerj.10805. eCollection 2021.

Weighted minimizer sampling improves long read mapping.加权最小化抽样提高长读测序数据的比对。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.

Differentiable Learning of Sequence-Specific Minimizer Schemes with DeepMinimizer.使用 DeepMinimizer 进行序列特异性最小化方案的可微学习。

J Comput Biol. 2022 Dec;29(12):1288-1304. doi: 10.1089/cmb.2022.0275. Epub 2022 Sep 12.

引用本文的文献

GreedyMini: generating low-density DNA minimizers.GreedyMini：生成低密度DNA最小化子

Bioinformatics. 2025 Jul 1;41(Supplement_1):i275-i284. doi: 10.1093/bioinformatics/btaf251.

Fast and flexible minimizer digestion with digest.使用digest进行快速灵活的最小化酶切。

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf368.

The open-closed mod-minimizer algorithm.开闭模极小化算法。

Algorithms Mol Biol. 2025 Mar 17;20(1):4. doi: 10.1186/s13015-025-00270-0.

Fast and flexible minimizer digestion with digest.使用digest进行快速灵活的最小化消化。

bioRxiv. 2025 Jan 8:2025.01.02.631161. doi: 10.1101/2025.01.02.631161.

本文引用的文献

When less is more: sketching with minimizers in genomics.少即是多：基因组学中的最小化器草图。

Genome Biol. 2024 Oct 14;25(1):270. doi: 10.1186/s13059-024-03414-4.

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets.一种用于高效找到近似最优通用命中集的随机并行算法。

Res Comput Mol Biol. 2020 May;12074:37-53. doi: 10.1007/978-3-030-45257-5_3. Epub 2020 Apr 21.

Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。

J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

The complete sequence of a human Y chromosome.人类 Y 染色体的完整序列。

Nature. 2023 Sep;621(7978):344-354. doi: 10.1038/s41586-023-06457-y. Epub 2023 Aug 23.

Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.极小值是极小值的推广，能够实现无偏的局部杰卡德估计。

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad512.

Efficient minimizer orders for large values of using minimum decycling sets.利用最小去环集对大值进行有效最小化排序。

Genome Res. 2023 Jul;33(7):1154-1161. doi: 10.1101/gr.277644.123. Epub 2023 Aug 9.

Theory of local k-mer selection with applications to long-read alignment.基于局部 k-mer 选择的理论及其在长读测序比对中的应用。

Bioinformatics. 2022 Oct 14;38(20):4659-4669. doi: 10.1093/bioinformatics/btab790.

Differentiable Learning of Sequence-Specific Minimizer Schemes with DeepMinimizer.使用 DeepMinimizer 进行序列特异性最小化方案的可微学习。

J Comput Biol. 2022 Dec;29(12):1288-1304. doi: 10.1089/cmb.2022.0275. Epub 2022 Sep 12.

Sparse and skew hashing of K-mers.K- -mer 的稀疏和偏斜哈希。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.

Sequence-specific minimizers via polar sets.通过极集实现序列特异性最小化。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i187-i195. doi: 10.1093/bioinformatics/btab313.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

前向采样方案密度的一个近乎紧密的下界。

A near-tight lower bound on the density of forward sampling schemes.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献