Suppr超能文献

前向采样方案密度的一个近乎紧密的下界。

A near-tight lower bound on the density of forward sampling schemes.

作者信息

Kille Bryce, Groot Koerkamp Ragnar, McAdams Drake, Liu Alan, Treangen Todd J

机构信息

Department of Computer Science, Rice University, Houston, TX 77005, United States.

Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.

出版信息

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae736.

Abstract

MOTIVATION

Sampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e. have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two.

RESULTS

We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we observe that our bound is tight when k≡1(mod w). For large w and k, the bound can be approximated by 1w+k⌈w+kw⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al. is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k≡1(mod w) and the alphabet size σ goes to ∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound.

AVAILABILITY AND IMPLEMENTATION

Minimizer implementations: github.com/RagnarGrootKoerkamp/minimizers ILP and analysis: github.com/treangenlab/sampling-scheme-analysis.

摘要

动机

在序列分析算法中,对k-mer进行采样是一项普遍存在的任务。诸如常用的随机最小化器方案之类的采样方案特别有吸引力,因为它们保证从每w个连续的k-mer中至少选择一个k-mer。采样较少的k-mer通常会提高下游方法的效率。因此,开发具有低密度(即采样的k-mer比例小)的方案是一个活跃的研究领域。经过十多年在降低实际方案密度和提高最佳可能密度下限方面的持续努力,两者之间仍存在很大差距。

结果

我们证明了前向采样方案密度的一个近乎紧密的下限,前向采样方案是一类推广了最小化器方案的方案。对于较小的w和k,我们观察到当k≡1(mod w)时,我们的下限是紧密的。对于较大的w和k,该下限可以近似为1 / (w + k⌈(w + k) / w⌉)。重要的是,我们的下限意味着现有方案比以前已知的方案更接近实现最优密度。例如,在当前默认的minimap2 HiFi设置w = 19和k = 19下,我们表明对于这些参数,最著名的方案,即Pellow等人基于双去环集的最小化器,比最优方案最多密集3%,而之前的差距最多为50%。此外,当k≡1(mod w)且字母表大小σ趋于无穷大时,我们表明Groot Koerkamp和Pibiri引入的模最小化器实现了与我们下限匹配的最优密度。

可用性和实现

最小化器实现:github.com/RagnarGrootKoerkamp/minimizers 整数线性规划和分析:github.com/treangenlab/sampling-scheme-analysis

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f40d/11676336/aa1a51260f3c/btae736f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验