富尔戈尔：一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。

Fulgor: A fast and compact -mer index for large-scale matching and color queries.

作者信息

Fan Jason, Singh Noor Pratap, Khan Jamshed, Pibiri Giulio Ermanno, Patro Rob

机构信息

Department of Computer Science, University of Maryland, College Park, MD 20440, USA.

DAIS, Ca' Foscari University of Venice, Venice, Italy.

出版信息

bioRxiv. 2023 May 20:2023.05.09.539895. doi: 10.1101/2023.05.09.539895.

DOI:10.1101/2023.05.09.539895

PMID:37214944

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10197524/

Abstract

The problem of sequence identification or matching - determining the subset of references from a given collection that are likely to contain a query nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resourceefficient solution to this problem is of utmost importance. The reference collection should therefore be pre-processed into an for fast queries. This poses the threefold challenge of designing an index that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, -mer dictionaries can be combined with a compressed inverted index to implement a fast and compact graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are (all -mers in a unitig have the same set of references of origin, or "color"), leveraging the property of its dictionary. In fact, -mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from -mers to their inverted lists in as little as 1+(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2 - 6× faster to construct.

摘要

序列识别或匹配问题——从给定集合中确定可能包含查询核苷酸序列的参考文献子集——在计算生物学的许多重要任务中都很重要，例如宏基因组学和泛基因组分析。由于此类分析的复杂性以及参考文献集合的规模巨大，因此找到一种资源高效的解决方案至关重要。因此，应将参考文献集合预处理成一种索引以便进行快速查询。这带来了三重挑战，即设计一种查询效率高、内存使用少且能很好地扩展到大型集合的索引。为了解决这个问题，我们描述了如何将关联的、保序的k - mer字典的最新进展与压缩倒排索引相结合，以实现一种快速且紧凑的彩色图数据结构。该索引充分利用了彩色德布鲁因图中的单元重叠群是单色的这一事实（一个单元重叠群中的所有k - mer都具有相同的起源参考文献集，即“颜色”），利用了其字典的单色属性。实际上，字典按单元重叠群顺序保存k - mer，从而允许将从k - mer到其倒排列表的映射编码为每个单元重叠群仅1 + o(√n) 位。因此，每个单元重叠群的一个倒排列表几乎没有空间/时间开销地存储在索引中。通过将此属性与简单但有效的倒排列表压缩方法相结合，该索引实现了非常小的空间占用。我们在一个名为Fulgor的工具中实现了这些方法。与之前的技术水平Themisto相比，Fulgor在索引30691个细菌基因组的异构集合时占用空间减少3.8倍，在索引150000个病毒基因组的集合时占用空间减少约2倍，在进行颜色查询时速度至少快两倍，并且构建速度快2 - 6倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f891/10201636/d5ceca8f2700/nihpp-2023.05.09.539895v2-f0001.jpg

相似文献

Fulgor: A fast and compact -mer index for large-scale matching and color queries.富尔戈尔：一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。

bioRxiv. 2023 May 20:2023.05.09.539895. doi: 10.1101/2023.05.09.539895.

Fulgor: a fast and compact k-mer index for large-scale matching and color queries.Fulgor：一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。

Algorithms Mol Biol. 2024 Jan 22;19(1):3. doi: 10.1186/s13015-024-00251-9.

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.模式所在：带重复感知的彩色 de Bruijn 图压缩。

J Comput Biol. 2024 Oct;31(10):1022-1044. doi: 10.1089/cmb.2024.0714. Epub 2024 Oct 9.

Where the patterns are: repetition-aware compression for colored de Bruijn graphs .其模式为：彩色德布鲁因图的重复感知压缩。

bioRxiv. 2024 Jul 13:2024.07.09.602727. doi: 10.1101/2024.07.09.602727.

Compression Algorithm for Colored de Bruijn Graphs.彩色德布鲁因图的压缩算法

Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.17. Epub 2023 Aug 29.

A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

Meta-colored compacted de Bruijn graphs.元彩色压缩德布鲁因图

bioRxiv. 2023 Nov 1:2023.07.21.550101. doi: 10.1101/2023.07.21.550101.

Compression algorithm for colored de Bruijn graphs.彩色德布鲁因图的压缩算法。

Algorithms Mol Biol. 2024 May 26;19(1):20. doi: 10.1186/s13015-024-00254-6.

Representation of -Mer Sets Using Spectrum-Preserving String Sets.使用谱保持串集表示 -Mer 集。

J Comput Biol. 2021 Apr;28(4):381-394. doi: 10.1089/cmb.2020.0431. Epub 2020 Dec 7.

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Themisto：一种可扩展的彩色 k-mer 索引，可用于对数十万细菌基因组进行敏感的伪比对。

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i260-i269. doi: 10.1093/bioinformatics/btad233.

本文引用的文献

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i260-i269. doi: 10.1093/bioinformatics/btad233.

Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT.使用 GGCAT 实现紧凑且着色的 de Bruijn 图的快速构建和查询。

Genome Res. 2023 Jul;33(7):1198-1207. doi: 10.1101/gr.277615.122. Epub 2023 May 30.

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping.KMCP：通过伪映射对原核生物和病毒种群进行准确的宏基因组分析。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac845.

Sparse and skew hashing of K-mers.K- -mer 的稀疏和偏斜哈希。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.

The complete sequence of a human genome.人类基因组的完整序列。

Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data.Alevin-fry实现了对单细胞RNA测序数据的快速、准确且节省内存的定量分析。

Nat Methods. 2022 Mar;19(3):316-322. doi: 10.1038/s41592-022-01408-3. Epub 2022 Mar 11.

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.通过对存档DNA序列的精心整理和可搜索快照探索细菌多样性。

PLoS Biol. 2021 Nov 9;19(11):e3001421. doi: 10.1371/journal.pbio.3001421. eCollection 2021 Nov.

High-resolution sweep metagenomics using fast probabilistic inference.使用快速概率推理的高分辨率扫描宏基因组学。

Wellcome Open Res. 2021 Oct 8;5:14. doi: 10.12688/wellcomeopenres.15639.2. eCollection 2020.

HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data.HumGut：基于宏基因组数据过滤的综合人类肠道原核基因组集。

Microbiome. 2021 Jul 31;9(1):165. doi: 10.1186/s40168-021-01114-w.

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.乌贼算法：从大规模基因组集合中快速、并行且低内存消耗的 de Bruijn 图压缩。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i177-i186. doi: 10.1093/bioinformatics/btab309.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

富尔戈尔：一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。

Fulgor: A fast and compact -mer index for large-scale matching and color queries.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献