Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218-2683, USA.
Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218-2683, USA
Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.
A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing data set. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. Although existing tools can easily compare tens of thousands of genomes, data sets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. Here, we describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog (HLL) but discards use of leading zero count in favor of a truncated logarithm of adjustable base. Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences. It achieves superior similarity estimates for the Jaccard coefficient and average nucleotide identity compared with the original Dashing, but in much less time while using the same-sized sketch. Dashing 2 is a free, open source software.
基因组草图是测序数据集的 k-mer 集合的小概率表示形式。草图是大规模分析的构建块,这些分析考虑了许多序列对或序列集合之间的相似性。尽管现有工具可以轻松比较数万个基因组,但数据集可能会达到数百万个序列甚至更多。流行的工具也未能考虑 k-mer 的多重性,因此在定量环境中不太适用。在这里,我们描述了一种称为 Dashing 2 的方法,它建立在 SetSketch 数据结构之上。SetSketch 与 HyperLogLog (HLL) 相关,但摒弃了使用前导零计数,转而采用可调基数的截断对数。与 HLL 不同,当与 ProbMinHash 方法结合使用时,SetSketch 可以进行多重感知草图绘制。Dashing 2 集成了局部敏感哈希算法,可将所有对比较扩展到数百万个序列。与原始 Dashing 相比,它实现了 Jaccard 系数和平均核苷酸同一性的优越相似性估计,但时间更短,同时使用的草图大小相同。Dashing 2 是一款免费的开源软件。