DSK：使用极低内存进行 k-mer 计数。

DSK: k-mer counting with very low memory usage.

机构信息

Algorizk, 75013 Paris, France.

出版信息

Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.

DOI:10.1093/bioinformatics/btt020

PMID:23325618

Abstract

SUMMARY

Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.

AVAILABILITY

http://minia.genouest.org/dsk

摘要

在 DNA/RNA 测序reads 中计算所有的 k-mer（长度为 k 的子字符串）是许多生物信息学应用的初步步骤。然而，最先进的 k-mer 计数方法要求大量的数据结构驻留在内存中。这种结构通常随着要计数的不同 k-mer 的数量而增长。我们提出了一种新的用于 k-mer 计数的流式算法，称为 DSK（k-mer 的磁盘流），它只需要固定的用户定义的内存和磁盘空间。这种方法实现了内存、时间和磁盘之间的权衡。在读取中出现的所有 k-mer 的多集被分区，并将分区保存到磁盘。然后，每个分区分别在内存中的临时哈希表中加载。通过遍历每个哈希表返回 k-mer 计数。可选地过滤低丰度的 k-mer。DSK 是第一个能够仅使用 4.0GB 内存和适度磁盘空间（160GB）来计算人类基因组数据集的所有 27-mer 的方法，耗时 17.9 小时。DSK 可以在小内存服务器上替代流行的 k-mer 计数软件（Jellyfish）。