Suppr超能文献

开闭模极小化算法。

The open-closed mod-minimizer algorithm.

作者信息

Groot Koerkamp Ragnar, Liu Daniel, Pibiri Giulio Ermanno

机构信息

ETH Zurich, Zurich, Switzerland.

University of California, Los Angeles, California, USA.

出版信息

Algorithms Mol Biol. 2025 Mar 17;20(1):4. doi: 10.1186/s13015-025-00270-0.

Abstract

Sampling algorithms that deterministically select a subset of -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one -mer out of every window of w consecutive -mers. The folklore and most used scheme is the random minimizer that selects the smallest -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected -mers) of . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.

摘要

确定性地选择k聚体子集的采样算法是生物信息学应用中的一个重要组成部分。例如,它们用于对大型文本集合(如DNA)进行索引,并快速比较序列。在这类应用中,需要一种采样算法从每w个连续k聚体的窗口中选择一个k聚体。传统且最常用的方案是随机最小化器,它根据某种随机顺序选择窗口中最小的k聚体。该方案非常简单且通用,其密度(所选k聚体的预期比例)为1/w。在实践中,较低的密度会带来更快的方法和更小的索引,事实证明随机最小化器并非最佳选择。实际上,已知一些方案在w较大时接近最优密度1/w,比如最近引入的模最小化器(Groot Koerkamp和Pibiri,WABI 2024)。在这项工作中,我们研究在w较小时实现低密度的方法。在这个小k的情况下,一种密度比随机最小化器有可证明更好表现的实用方法是小感知(Zheng等人,《生物信息学》2021年)。该方法可以简洁地描述为根据某种随机顺序在窗口中采样最小的封闭同步聚体(Edgar,《PeerJ》2021年)。我们表明,将小感知扩展为优先采样开放同步聚体可产生更好的密度。这种新方法——开闭最小化器——在k较小时提供了改进的密度,同时计算速度与随机最小化器一样快。与基于去环集的方法相比,后者在小k情况下实现了非常低的密度,我们的方法具有相当的密度,同时计算更简单且直观。此外,我们扩展了模最小化器,以提高任何在小k时效果良好的方案在k较大时的密度。因此,我们得到了开闭模最小化器,这是一种在所有k值上都优于模最小化器的实用方法。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验