Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA.
BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.
Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.
We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors.
A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html.
计数 k-mer(DNA 序列数据中长度为 k 的子字符串)是生物信息学中许多方法的基本组成部分,包括基因组和转录组组装、宏基因组测序以及序列读取的错误纠正。尽管原则上很简单,但在大型现代序列数据集上计数 k-mer 很容易超出标准计算机的内存容量。在当前的数据集,大量的存储容量-通常超过 50%-可能用于存储包含测序错误的 k-mer,这些 k-mer 通常在数据中只观察到一次。这些单例 k-mer 对于许多没有某种错误纠正的算法来说是无信息的。
我们提出了一种新的方法,可以识别 DNA 序列数据集中出现多次的所有 k-mer。我们的方法使用布隆过滤器(一种概率数据结构)来实现这一点,该结构使用内存隐式存储所有观察到的 k-mer,从而大大减少了内存需求。然后,我们再次遍历数据,提供所有非唯一 k-mer 的精确计数。例如,对于数据集,与当前软件相比,我们报告内存使用量节省了高达 50%,而计算速度的成本适中。对于任何从具有错误的序列数据开始计数 k-mer 的算法,这种方法都可以减少内存需求。
这种方法的参考实现,BFCounter,是用 C++编写的,并且是 GPL 许可的。它可在 http://pritch.bsd.uchicago.edu/bfcounter.html 免费下载。