BLEND：一种在基因组分析中快速、节省内存且准确地查找模糊种子匹配项的机制。

BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.

作者信息

Firtina Can, Park Jisung, Alser Mohammed, Kim Jeremie S, Cali Damla Senol, Shahroodi Taha, Ghiasi Nika Mansouri, Singh Gagandeep, Kanellopoulos Konstantinos, Alkan Can, Mutlu Onur

机构信息

ETH Zurich, Zurich 8092, Switzerland.

POSTECH, Pohang 37673, Republic of Korea.

出版信息

NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar.

DOI:10.1093/nargab/lqad004

PMID:36685727

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9853099/

Abstract

Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce , the first efficient and accurate mechanism that can identify exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.

摘要

生成短子序列（称为种子）的哈希值，能够通过将种子与其哈希值进行单次查找匹配，快速识别基因组序列之间的相似性。然而，这些哈希值仅可用于查找精确匹配的种子，因为传统的哈希方法会为不同的种子（包括高度相似的种子）分配不同的哈希值。仅查找精确匹配的种子会导致以下两种情况之一：（i）增加成本高昂的序列比对的使用频率，或者（ii）灵敏度受限。我们引入了BLEND，这是第一种高效且准确的机制，它能够通过单次查找种子的哈希值来识别精确匹配和高度相似的种子，即模糊种子匹配。BLEND（i）利用一种名为SimHash的技术，该技术可以为相似的集合生成相同的哈希值，并且（ii）提供了适当的机制，以便在使用SimHash技术时将种子用作集合，从而有效地找到模糊种子匹配。我们展示了BLEND在用于读取重叠和读取映射时的优势。对于读取重叠，与最先进的工具minimap2相比，BLEND的速度快2.4倍至83.9倍（平均为19.3倍），内存占用低0.9倍至14.1倍（平均为3.8倍），并且能够找到更高质量的重叠，从而实现更准确的组装。对于读取映射，BLEND比minimap2快0.8倍至4.1倍（平均为1.7倍）。源代码可在https://github.com/CMU - SAFARI/BLEND获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bb30/9853099/2a235f84fda1/lqad004fig1.jpg

相似文献

BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.

NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar.

ntHash2: recursive spaced seed hashing for nucleotide sequences.

Bioinformatics. 2022 Oct 14;38(20):4812-4813. doi: 10.1093/bioinformatics/btac564.

RawHash2: mapping raw nanopore signals using hash-based seeding and adaptive quantization.

Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae478.

Perfect Hamming code with a hash table for faster genome mapping.

BMC Genomics. 2011 Nov 30;12 Suppl 3(Suppl 3):S8. doi: 10.1186/1471-2164-12-S3-S8.

RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes.

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i297-i307. doi: 10.1093/bioinformatics/btad272.

Global, highly specific and fast filtering of alignment seeds.

BMC Bioinformatics. 2022 Jun 10;23(1):225. doi: 10.1186/s12859-022-04745-4.

SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs.

Bioinformatics. 2021 Apr 1;36(22-23):5282-5290. doi: 10.1093/bioinformatics/btaa1015.

Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences.

Comput Biol Med. 2021 Apr;131:104292. doi: 10.1016/j.compbiomed.2021.104292. Epub 2021 Feb 22.

FSH: fast spaced seed hashing exploiting adjacent hashes.

Algorithms Mol Biol. 2018 Mar 22;13:8. doi: 10.1186/s13015-018-0125-4. eCollection 2018.

rHAT: fast alignment of noisy long reads with regional hashing.

Bioinformatics. 2016 Jun 1;32(11):1625-31. doi: 10.1093/bioinformatics/btv662. Epub 2015 Nov 14.

引用本文的文献

xRead: a coverage-guided approach for scalable construction of read overlapping graph.

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf007.

Taming large-scale genomic analyses via sparsified genomics.

Nat Commun. 2025 Jan 21;16(1):876. doi: 10.1038/s41467-024-55762-1.

dna2bit: high performance genomic distance estimation software for microbial genome analysis.

Front Microbiol. 2024 Dec 23;15:1521181. doi: 10.3389/fmicb.2024.1521181. eCollection 2024.

TargetCall: eliminating the wasted computation in basecalling via pre-basecalling filtering.

Front Genet. 2024 Oct 28;15:1429306. doi: 10.3389/fgene.2024.1429306. eCollection 2024.

Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection.

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae545.

Improved sub-genomic RNA prediction with the ARTIC protocol.

Nucleic Acids Res. 2024 Sep 23;52(17):e82. doi: 10.1093/nar/gkae687.

RawHash2: mapping raw nanopore signals using hash-based seeding and adaptive quantization.

Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae478.

HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors.

Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae452.

Enhancing insights into diseases through horizontal gene transfer event detection from gut microbiome.

Nucleic Acids Res. 2024 Aug 12;52(14):e61. doi: 10.1093/nar/gkae515.

Designing efficient randstrobes for sequence similarity analyses.

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae187.

本文引用的文献

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes.

IEEE/ACM Trans Comput Biol Bioinform. 2024 Aug 19;PP. doi: 10.1109/TCBB.2024.3433378.

Accelerating Minimap2 for Accurate Long Read Alignment on GPUs.

J Biotechnol Biomed. 2023;6(1):13-23. doi: 10.26502/jbb.2642-91280067. Epub 2023 Jan 20.

Truvari: refined structural variant comparison preserves allelic diversity.

Genome Biol. 2022 Dec 27;23(1):271. doi: 10.1186/s13059-022-02840-6.

Strobealign: flexible seed size enables ultra-fast and accurate read alignment.

Genome Biol. 2022 Dec 15;23(1):260. doi: 10.1186/s13059-022-02831-7.

From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures.

Comput Struct Biotechnol J. 2022 Aug 18;20:4579-4599. doi: 10.1016/j.csbj.2022.08.019. eCollection 2022.

FastRemap: a tool for quickly remapping reads between genome assemblies.

Bioinformatics. 2022 Sep 30;38(19):4633-4635. doi: 10.1093/bioinformatics/btac554.

Critical Assessment of Metagenome Interpretation: the second round of challenges.

Nat Methods. 2022 Apr;19(4):429-440. doi: 10.1038/s41592-022-01431-4. Epub 2022 Apr 8.

Long-read mapping to repetitive reference sequences using Winnowmap2.

Nat Methods. 2022 Jun;19(6):705-710. doi: 10.1038/s41592-022-01457-8. Epub 2022 Apr 1.

Effective sequence similarity detection with strobemers.

Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.

GeNVoM: Read Mapping Near Non-Volatile Memory.

IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3482-3496. doi: 10.1109/TCBB.2021.3118018. Epub 2022 Dec 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

BLEND：一种在基因组分析中快速、节省内存且准确地查找模糊种子匹配项的机制。

BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.

作者信息

Firtina Can, Park Jisung, Alser Mohammed, Kim Jeremie S, Cali Damla Senol, Shahroodi Taha, Ghiasi Nika Mansouri, Singh Gagandeep, Kanellopoulos Konstantinos, Alkan Can, Mutlu Onur

机构信息

ETH Zurich, Zurich 8092, Switzerland.

POSTECH, Pohang 37673, Republic of Korea.

出版信息

NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar.

DOI:10.1093/nargab/lqad004

PMID:36685727

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9853099/

Abstract

摘要

BLEND：一种在基因组分析中快速、节省内存且准确地查找模糊种子匹配项的机制。

BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

BLEND：一种在基因组分析中快速、节省内存且准确地查找模糊种子匹配项的机制。

BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献