BLight：用于k-mer的高效精确关联结构。

BLight: efficient exact associative structure for k-mers.

作者信息

Marchet Camille, Kerbiriou Mael, Limasset Antoine

机构信息

University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France.

出版信息

Bioinformatics. 2021 Sep 29;37(18):2858-2865. doi: 10.1093/bioinformatics/btab217.

DOI:10.1093/bioinformatics/btab217

PMID:33821954

Abstract

MOTIVATION

A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough.

RESULTS

We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.

AVAILABILITY AND IMPLEMENTATION

We wrote the BLight index as an open source C++ library under the AGPL3 license available at github.com/Malfoy/BLight. It is designed as a user-friendly library and comes along with code usage samples.

摘要

动机

大量的方法和应用都有一个基本需求，即在高通量序列分析中为单词关联信息。对于数十亿个k-mer而言，这样做通常是一个可扩展性问题，因为精确的关联索引可能会消耗大量内存。最近的研究利用k-mer之间的重叠来应对这一挑战。然而，现有的数据结构要么无法为k-mer关联信息，要么不够轻量级。

结果

我们提出了BLight，一种静态且精确的数据结构，能够为k-mer关联唯一标识符，并确定它们在集合中的成员身份，且不会产生误报，它能够以低内存成本扩展到巨大的k-mer集合。该索引结合了极其紧凑的表示形式和非常快速的查询。此外，它的构建效率高，不需要额外的内存。我们的实现能够在10分钟内使用8GB内存（每个k-mer 23位）对人类基因组中的k-mer进行索引，并在76分钟内使用63GB内存（每个k-mer 27位）对大蝾螈基因组中的k-mer进行索引。此外，该索引在内存高效的同时，还提供了非常高的吞吐量：在单个CPU上每秒可进行140万次查询，使用12个内核时每秒可进行1610万次查询。最后，我们还展示了BLight如何实际表示宏基因组和转录组测序数据，以突出其广泛的应用范围。

可用性和实现

我们将BLight索引编写为一个开源的C++库，遵循AGPL3许可，可在github.com/Malfoy/BLight上获取。它被设计为一个用户友好的库，并附带了代码使用示例。

相似文献

BLight: efficient exact associative structure for k-mers.

Bioinformatics. 2021 Sep 29;37(18):2858-2865. doi: 10.1093/bioinformatics/btab217.

Squeakr: an exact and approximate k-mer counting system.

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

Sparse and skew hashing of K-mers.

Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.

A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours.

Bioinformatics. 2020 Dec 30;36(Suppl_2):i625-i633. doi: 10.1093/bioinformatics/btaa890.

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.

A space and time-efficient index for the compacted colored de Bruijn graph.

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

Fast Approximation of Frequent -Mers and Applications to Metagenomics.

J Comput Biol. 2020 Apr;27(4):534-549. doi: 10.1089/cmb.2019.0314. Epub 2019 Dec 20.

Turtle: identifying frequent k-mers with cache-efficient algorithms.

Bioinformatics. 2014 Jul 15;30(14):1950-7. doi: 10.1093/bioinformatics/btu132. Epub 2014 Mar 10.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

PLoS One. 2014 Jul 25;9(7):e101271. doi: 10.1371/journal.pone.0101271. eCollection 2014.

DSK: k-mer counting with very low memory usage.

Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.

引用本文的文献

K2R: Tinted de Bruijn graphs implementation for efficient read extraction from sequencing datasets.

Bioinform Adv. 2025 May 14;5(1):vbaf111. doi: 10.1093/bioadv/vbaf111. eCollection 2025.

The open-closed mod-minimizer algorithm.

Algorithms Mol Biol. 2025 Mar 17;20(1):4. doi: 10.1186/s13015-025-00270-0.

GRAMEP: an alignment-free method based on the maximum entropy principle for identifying SNPs.

BMC Bioinformatics. 2025 Feb 25;26(1):66. doi: 10.1186/s12859-025-06037-z.

Fractional hitting sets for efficient multiset sketching.

Algorithms Mol Biol. 2025 Feb 8;20(1):1. doi: 10.1186/s13015-024-00268-0.

When less is more: sketching with minimizers in genomics.

Genome Biol. 2024 Oct 14;25(1):270. doi: 10.1186/s13059-024-03414-4.

Creating and Using Minimizer Sketches in Computational Genomics.

J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

Efficient minimizer orders for large values of using minimum decycling sets.

Genome Res. 2023 Jul;33(7):1154-1161. doi: 10.1101/gr.277644.123. Epub 2023 Aug 9.

Scalable sequence database search using partitioned aggregated Bloom comb trees.

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.

Locality-preserving minimal perfect hashing of k-mers.

Bioinformatics. 2023 Jun 30;39(Suppl 1):i534-i543. doi: 10.1093/bioinformatics/btad219.

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.

Genome Biol. 2022 Sep 8;23(1):190. doi: 10.1186/s13059-022-02743-6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

BLight：用于k-mer的高效精确关联结构。

BLight: efficient exact associative structure for k-mers.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献