Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, F-59000, France.
Bioinformatics. 2024 Jun 28;40(Suppl 1):i48-i57. doi: 10.1093/bioinformatics/btae217.
In this article, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for representing k-mer sets. Originating from Conway and Bromage's concept, CBL innovatively employs the smallest cyclic rotations of k-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fano's scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamic k-mer set methods. Unique to this work, CBL stands out as the only known exact k-mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure for k-mer set management.
本文介绍了 Conway-Bromage-Lyndon(CBL)结构,这是一种用于表示 k-mer 集的压缩、动态且精确的方法。CBL 源自 Conway 和 Bromage 的概念,创新性地利用了最小的 k-mer 循环旋转,类似于 Lyndon 单词,以利用词法冗余。为了支持动态操作和集合操作,我们提出了一种动态位向量结构,该结构与 Elias-Fano 方案并行。该结构被封装在一个 Rust 库中,展示了构建效率、缓存局部性和压缩之间的平衡融合。我们的研究结果表明,CBL 优于现有的动态 k-mer 集方法。CBL 的独特之处在于它是唯一已知的提供原地集合操作的精确 k-mer 结构。其不同的组合能力使其成为 k-mer 集管理的灵活瑞士军刀结构。