Department of Computer Science, University of California, Davis, CA, USA.
Department of Population Health and Reproduction, School of Veterinary Medicine, University of California, Davis, CA, USA.
BMC Bioinformatics. 2021 Feb 16;22(1):71. doi: 10.1186/s12859-021-03996-x.
Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution.
Here, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions.
The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.
在线算法需要专门的数据结构才能有效地处理大型测序数据集。计数商滤波器 (CQF) 是一种紧凑的哈希表,可以有效地存储分布不均匀的 k-mer。
在这里,我们提出了混合计数器商滤波器 (MQF),作为 CQF 的一种新变体,具有新的计数和标记系统。新的计数系统适应更广泛的数据分布,以提高空间效率,并且在大多数测试场景中,插入和查询的速度比 CQF 快。MQF 的缓冲版本可以将存储卸载到磁盘上,以显著减少内存为代价换取插入和查询的速度。标记系统为成员项分配标签提供了一个灵活的框架,同时保持了良好的数据局部性和简洁的内存表示。这些标签用作最小完美哈希函数,但比 BBhash 快约十倍,无需为进一步的插入或删除重新分析原始数据。
MQF 是一种灵活高效的数据结构,扩展了我们处理高通量测序数据的能力。