Groot Koerkamp Ragnar, Liu Daniel, Pibiri Giulio Ermanno
ETH Zurich, Zurich, Switzerland.
University of California, Los Angeles, California, USA.
Algorithms Mol Biol. 2025 Mar 17;20(1):4. doi: 10.1186/s13015-025-00270-0.
Sampling algorithms that deterministically select a subset of -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one -mer out of every window of w consecutive -mers. The folklore and most used scheme is the random minimizer that selects the smallest -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected -mers) of . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.
确定性地选择k聚体子集的采样算法是生物信息学应用中的一个重要组成部分。例如,它们用于对大型文本集合(如DNA)进行索引,并快速比较序列。在这类应用中,需要一种采样算法从每w个连续k聚体的窗口中选择一个k聚体。传统且最常用的方案是随机最小化器,它根据某种随机顺序选择窗口中最小的k聚体。该方案非常简单且通用,其密度(所选k聚体的预期比例)为1/w。在实践中,较低的密度会带来更快的方法和更小的索引,事实证明随机最小化器并非最佳选择。实际上,已知一些方案在w较大时接近最优密度1/w,比如最近引入的模最小化器(Groot Koerkamp和Pibiri,WABI 2024)。在这项工作中,我们研究在w较小时实现低密度的方法。在这个小k的情况下,一种密度比随机最小化器有可证明更好表现的实用方法是小感知(Zheng等人,《生物信息学》2021年)。该方法可以简洁地描述为根据某种随机顺序在窗口中采样最小的封闭同步聚体(Edgar,《PeerJ》2021年)。我们表明,将小感知扩展为优先采样开放同步聚体可产生更好的密度。这种新方法——开闭最小化器——在k较小时提供了改进的密度,同时计算速度与随机最小化器一样快。与基于去环集的方法相比,后者在小k情况下实现了非常低的密度,我们的方法具有相当的密度,同时计算更简单且直观。此外,我们扩展了模最小化器,以提高任何在小k时效果良好的方案在k较大时的密度。因此,我们得到了开闭模最小化器,这是一种在所有k值上都优于模最小化器的实用方法。