Suppr超能文献

使用 Dashing 2 进行多重性和位置敏感哈希的基因组草图绘制。

Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218-2683, USA.

Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218-2683, USA

出版信息

Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.

Abstract

A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing data set. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. Although existing tools can easily compare tens of thousands of genomes, data sets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. Here, we describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog (HLL) but discards use of leading zero count in favor of a truncated logarithm of adjustable base. Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences. It achieves superior similarity estimates for the Jaccard coefficient and average nucleotide identity compared with the original Dashing, but in much less time while using the same-sized sketch. Dashing 2 is a free, open source software.

摘要

基因组草图是测序数据集的 k-mer 集合的小概率表示形式。草图是大规模分析的构建块,这些分析考虑了许多序列对或序列集合之间的相似性。尽管现有工具可以轻松比较数万个基因组,但数据集可能会达到数百万个序列甚至更多。流行的工具也未能考虑 k-mer 的多重性,因此在定量环境中不太适用。在这里,我们描述了一种称为 Dashing 2 的方法,它建立在 SetSketch 数据结构之上。SetSketch 与 HyperLogLog (HLL) 相关,但摒弃了使用前导零计数,转而采用可调基数的截断对数。与 HLL 不同,当与 ProbMinHash 方法结合使用时,SetSketch 可以进行多重感知草图绘制。Dashing 2 集成了局部敏感哈希算法,可将所有对比较扩展到数百万个序列。与原始 Dashing 相比,它实现了 Jaccard 系数和平均核苷酸同一性的优越相似性估计,但时间更短,同时使用的草图大小相同。Dashing 2 是一款免费的开源软件。

相似文献

1
Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.
Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.
2
Dashing: fast and accurate genomic distances with HyperLogLog.
Genome Biol. 2019 Dec 4;20(1):265. doi: 10.1186/s13059-019-1875-0.
3
Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.
J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.
4
Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to -Mer Annotation.
J Comput Biol. 2022 Feb;29(2):140-154. doi: 10.1089/cmb.2021.0429. Epub 2022 Jan 18.
5
A space and time-efficient index for the compacted colored de Bruijn graph.
Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.
6
Creating and Using Minimizer Sketches in Computational Genomics.
J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.
8
HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors.
Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae452.
10
On the Maximal Independent Sets of -mers with the Edit Distance.
ACM BCB. 2023 Sep;2023. doi: 10.1145/3584371.3612982. Epub 2023 Oct 4.

引用本文的文献

1
Mumemto: efficient maximal matching across pangenomes.
Genome Biol. 2025 Jun 17;26(1):169. doi: 10.1186/s13059-025-03644-0.
2
EvANI benchmarking workflow for evolutionary distance estimation.
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf267.
3
Longitudinal profiling of low-abundance strains in microbiomes with ChronoStrain.
Nat Microbiol. 2025 May;10(5):1184-1197. doi: 10.1038/s41564-025-01983-z. Epub 2025 May 6.
4
RabbitSketch: a high-performance sketching library for genome analysis.
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf249.
5
EvANI benchmarking workflow for evolutionary distance estimation.
bioRxiv. 2025 Feb 23:2025.02.23.639716. doi: 10.1101/2025.02.23.639716.
6
Fractional hitting sets for efficient multiset sketching.
Algorithms Mol Biol. 2025 Feb 8;20(1):1. doi: 10.1186/s13015-024-00268-0.
7
-mer approaches for biodiversity genomics.
Genome Res. 2025 Feb 14;35(2):219-230. doi: 10.1101/gr.279452.124.
8
Mumemto: efficient maximal matching across pangenomes.
bioRxiv. 2025 Jan 5:2025.01.05.631388. doi: 10.1101/2025.01.05.631388.
9
Combining DNA and protein alignments to improve genome annotation with LiftOn.
Genome Res. 2025 Feb 14;35(2):311-325. doi: 10.1101/gr.279620.124.

本文引用的文献

1
HyperMinHash: MinHash in LogLog space.
IEEE Trans Knowl Data Eng. 2022 Jan;34(1):328-339. doi: 10.1109/tkde.2020.2981311. Epub 2020 Mar 17.
2
Fast and robust metagenomic sequence comparison through sparse chaining with skani.
Nat Methods. 2023 Nov;20(11):1661-1665. doi: 10.1038/s41592-023-02018-3. Epub 2023 Sep 21.
3
On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference.
F1000Res. 2020 Nov 10;9:1309. doi: 10.12688/f1000research.26930.1. eCollection 2020.
4
Towards Genomic Criteria for Delineating Fungal Species.
J Fungi (Basel). 2020 Oct 24;6(4):246. doi: 10.3390/jof6040246.
5
Metalign: efficient alignment-based metagenomic profiling via containment min hash.
Genome Biol. 2020 Sep 10;21(1):242. doi: 10.1186/s13059-020-02159-0.
6
Weighted minimizer sampling improves long read mapping.
Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.
7
Dashing: fast and accurate genomic distances with HyperLogLog.
Genome Biol. 2019 Dec 4;20(1):265. doi: 10.1186/s13059-019-1875-0.
8
Mash Screen: high-throughput sequence containment estimation for genome discovery.
Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.
9
Locality-sensitive hashing for the edit distance.
Bioinformatics. 2019 Jul 15;35(14):i127-i135. doi: 10.1093/bioinformatics/btz354.
10
Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps.
Nat Commun. 2019 Jul 11;10(1):3066. doi: 10.1038/s41467-019-10934-2.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验