Suppr超能文献

BLight:用于k-mer的高效精确关联结构。

BLight: efficient exact associative structure for k-mers.

作者信息

Marchet Camille, Kerbiriou Mael, Limasset Antoine

机构信息

University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France.

出版信息

Bioinformatics. 2021 Sep 29;37(18):2858-2865. doi: 10.1093/bioinformatics/btab217.

Abstract

MOTIVATION

A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough.

RESULTS

We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.

AVAILABILITY AND IMPLEMENTATION

We wrote the BLight index as an open source C++ library under the AGPL3 license available at github.com/Malfoy/BLight. It is designed as a user-friendly library and comes along with code usage samples.

摘要

动机

大量的方法和应用都有一个基本需求,即在高通量序列分析中为单词关联信息。对于数十亿个k-mer而言,这样做通常是一个可扩展性问题,因为精确的关联索引可能会消耗大量内存。最近的研究利用k-mer之间的重叠来应对这一挑战。然而,现有的数据结构要么无法为k-mer关联信息,要么不够轻量级。

结果

我们提出了BLight,一种静态且精确的数据结构,能够为k-mer关联唯一标识符,并确定它们在集合中的成员身份,且不会产生误报,它能够以低内存成本扩展到巨大的k-mer集合。该索引结合了极其紧凑的表示形式和非常快速的查询。此外,它的构建效率高,不需要额外的内存。我们的实现能够在10分钟内使用8GB内存(每个k-mer 23位)对人类基因组中的k-mer进行索引,并在76分钟内使用63GB内存(每个k-mer 27位)对大蝾螈基因组中的k-mer进行索引。此外,该索引在内存高效的同时,还提供了非常高的吞吐量:在单个CPU上每秒可进行140万次查询,使用12个内核时每秒可进行1610万次查询。最后,我们还展示了BLight如何实际表示宏基因组和转录组测序数据,以突出其广泛的应用范围。

可用性和实现

我们将BLight索引编写为一个开源的C++库,遵循AGPL3许可,可在github.com/Malfoy/BLight上获取。它被设计为一个用户友好的库,并附带了代码使用示例。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验