针对大k值的k-mer字典进行节省空间的计算。

Díaz-Domínguez Diego, Leinonen Miika, Salmela Leena

Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, 00014, Helsinki, Finland.

Algorithms Mol Biol. 2024 Apr 5;19(1):14. doi: 10.1186/s13015-024-00259-1.

Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes time in the worst case, being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.

在许多基因组应用中，计算一组读段中的k-mer频率是一个常见的过程。一些先进的k-mer计数器依靠哈希表来执行此任务，但它们通常针对小的k值进行了优化，因为显式保存键（即k-mer序列）的哈希表需要计算机字，其中N是不同k-mer的数量，w是计算机字大小，对于较大的k值来说这是不切实际的。这种空间使用是一个重要的限制，因为对长且准确的HiFi测序读段进行分析可能需要更大的k值。我们提出了Kaarme，一种用于k-mer的空间高效哈希表，它使用字的空间，其中u是读段的数量。我们的框架利用了连续k-mer重叠个符号这一事实。因此，我们只存储k-mer的最后一个符号以及哈希表中指向先前k-mer的指针，我们可以使用该指针来恢复其余个符号。我们还对Kaarme进行了调整，以计算规范k-mer。这个变体同样利用哈希表中的指针来节省空间，但解码k-mer需要更多工作。具体来说，在最坏情况下需要时间，是DNA字母表，但我们的实验表明这种情况几乎不会发生。规范变体并没有改善我们的理论结果，但在实际应用中大大减少了空间使用，同时在获取k-mer及其频率方面保持了有竞争力的性能。我们将规范Kaarme与将规范k-mer作为键显式存储的常规哈希表进行了比较，结果表明我们的方法使用的空间减少了多达五倍，而速度慢不到1.5倍。我们还表明，当现有最先进的k-mer计数器不借助磁盘来保存中间结果时，规范Kaarme使用的内存要少得多。