Algorizk, 75013 Paris, France.
Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.
Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.
在 DNA/RNA 测序reads 中计算所有的 k-mer(长度为 k 的子字符串)是许多生物信息学应用的初步步骤。然而,最先进的 k-mer 计数方法要求大量的数据结构驻留在内存中。这种结构通常随着要计数的不同 k-mer 的数量而增长。我们提出了一种新的用于 k-mer 计数的流式算法,称为 DSK(k-mer 的磁盘流),它只需要固定的用户定义的内存和磁盘空间。这种方法实现了内存、时间和磁盘之间的权衡。在读取中出现的所有 k-mer 的多集被分区,并将分区保存到磁盘。然后,每个分区分别在内存中的临时哈希表中加载。通过遍历每个哈希表返回 k-mer 计数。可选地过滤低丰度的 k-mer。DSK 是第一个能够仅使用 4.0GB 内存和适度磁盘空间(160GB)来计算人类基因组数据集的所有 27-mer 的方法,耗时 17.9 小时。DSK 可以在小内存服务器上替代流行的 k-mer 计数软件(Jellyfish)。