Suppr超能文献

使用 Dashing 2 进行多重性和位置敏感哈希的基因组草图绘制。

Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218-2683, USA.

Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218-2683, USA

出版信息

Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.

Abstract

A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing data set. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. Although existing tools can easily compare tens of thousands of genomes, data sets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. Here, we describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog (HLL) but discards use of leading zero count in favor of a truncated logarithm of adjustable base. Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences. It achieves superior similarity estimates for the Jaccard coefficient and average nucleotide identity compared with the original Dashing, but in much less time while using the same-sized sketch. Dashing 2 is a free, open source software.

摘要

基因组草图是测序数据集的 k-mer 集合的小概率表示形式。草图是大规模分析的构建块,这些分析考虑了许多序列对或序列集合之间的相似性。尽管现有工具可以轻松比较数万个基因组,但数据集可能会达到数百万个序列甚至更多。流行的工具也未能考虑 k-mer 的多重性,因此在定量环境中不太适用。在这里,我们描述了一种称为 Dashing 2 的方法,它建立在 SetSketch 数据结构之上。SetSketch 与 HyperLogLog (HLL) 相关,但摒弃了使用前导零计数,转而采用可调基数的截断对数。与 HLL 不同,当与 ProbMinHash 方法结合使用时,SetSketch 可以进行多重感知草图绘制。Dashing 2 集成了局部敏感哈希算法,可将所有对比较扩展到数百万个序列。与原始 Dashing 相比,它实现了 Jaccard 系数和平均核苷酸同一性的优越相似性估计,但时间更短,同时使用的草图大小相同。Dashing 2 是一款免费的开源软件。

相似文献

6
Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。
J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

引用本文的文献

1
Mumemto: efficient maximal matching across pangenomes.Mumemto:跨泛基因组的高效最大匹配
Genome Biol. 2025 Jun 17;26(1):169. doi: 10.1186/s13059-025-03644-0.
6
Fractional hitting sets for efficient multiset sketching.用于高效多重集草图绘制的分数击中集
Algorithms Mol Biol. 2025 Feb 8;20(1):1. doi: 10.1186/s13015-024-00268-0.
7
-mer approaches for biodiversity genomics.用于生物多样性基因组学的-mer方法。
Genome Res. 2025 Feb 14;35(2):219-230. doi: 10.1101/gr.279452.124.
8
Mumemto: efficient maximal matching across pangenomes.Mumemto:跨全基因组的高效最大匹配
bioRxiv. 2025 Jan 5:2025.01.05.631388. doi: 10.1101/2025.01.05.631388.

本文引用的文献

1
HyperMinHash: MinHash in LogLog space.超最小哈希:对数对数空间中的最小哈希。
IEEE Trans Knowl Data Eng. 2022 Jan;34(1):328-339. doi: 10.1109/tkde.2020.2981311. Epub 2020 Mar 17.
6
Weighted minimizer sampling improves long read mapping.加权最小化抽样提高长读测序数据的比对。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.
9
Locality-sensitive hashing for the edit distance.基于编辑距离的位置敏感哈希
Bioinformatics. 2019 Jul 15;35(14):i127-i135. doi: 10.1093/bioinformatics/btz354.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验