Golan Shay, Tziony Ido, Kraus Matan, Orenstein Yaron, Shur Arseny
Department of Computer Science, University of Haifa, Haifa 3498838, Israel.
Efi Arazi School of Computer Science, Reichman University, Herzliya 4610101, Israel.
Bioinformatics. 2025 Jul 1;41(Supplement_1):i275-i284. doi: 10.1093/bioinformatics/btaf251.
Minimizers are the most popular k-mer selection scheme in algorithms and data structures analyzing high-throughput sequencing (HTS) data. In a minimizer scheme, the smallest k-mer by some predefined order is selected as the representative of a sequence window containing w consecutive k-mers, which results in overlapping windows often selecting the same k-mer. Minimizers that achieve the lowest frequency of selected k-mers over a random DNA sequence, termed the expected density, are desired for improved performance of HTS analyses. Yet, no method to date exists to generate minimizers that achieve minimum expected density. Moreover, for k and w values used by common HTS algorithms and data structures, there is a gap between densities achieved by existing selection schemes and the theoretical lower bound.
We developed GreedyMini, a toolkit of methods to generate minimizers with low expected or particular density, to improve minimizers, to extend minimizers to larger alphabets, k, and w, and to measure the expected density of a given minimizer efficiently. We demonstrate over various combinations of k and w values, including those of popular HTS methods, that GreedyMini can generate DNA minimizers that achieve expected densities very close to the lower bound, and both expected and particular densities much lower compared to existing selection schemes. Moreover, we show that GreedyMini's k-mer rank-retrieval time is comparable to common k-mer hash functions. We expect GreedyMini to improve the performance of many HTS algorithms and data structures and advance the research of k-mer selection schemes.
The toolkit, its source code, and precomputed minimizers for a variety of (k,w) pairs are available via https://github.com/OrensteinLab/GreedyMini.
在分析高通量测序(HTS)数据的算法和数据结构中,最小化子是最流行的k-mer选择方案。在最小化子方案中,按某种预定义顺序选择的最小k-mer被选作包含w个连续k-mer的序列窗口的代表,这导致重叠窗口经常选择相同的k-mer。对于改进HTS分析的性能而言,期望在随机DNA序列上实现所选k-mer最低频率(称为期望密度)的最小化子。然而,迄今为止还没有方法来生成具有最小期望密度的最小化子。此外,对于常见HTS算法和数据结构所使用的k和w值,现有选择方案所实现的密度与理论下限之间存在差距。
我们开发了GreedyMini,这是一个用于生成具有低期望密度或特定密度的最小化子、改进最小化子、将最小化子扩展到更大字母表、k和w以及有效测量给定最小化子期望密度的方法工具包。我们通过各种k和w值的组合(包括流行HTS方法的组合)证明,GreedyMini可以生成期望密度非常接近下限的DNA最小化子,并且与现有选择方案相比,期望密度和特定密度都要低得多。此外,我们表明GreedyMini 的k-mer排名检索时间与常见的k-mer哈希函数相当。我们期望GreedyMini能提高许多HTS算法和数据结构的性能,并推动k-mer选择方案的研究。
该工具包、其源代码以及针对各种(k,w)对的预计算最小化子可通过https://github.com/OrensteinLab/GreedyMini获得。