Suppr超能文献

Fulgor:一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。

Fulgor: a fast and compact k-mer index for large-scale matching and color queries.

作者信息

Fan Jason, Khan Jamshed, Singh Noor Pratap, Pibiri Giulio Ermanno, Patro Rob

机构信息

Department of Computer Science, University of Maryland, College Park, MD, 20742, USA.

DAIS, Ca' Foscari University of Venice, Venice, Italy.

出版信息

Algorithms Mol Biol. 2024 Jan 22;19(1):3. doi: 10.1186/s13015-024-00251-9.

Abstract

The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.

摘要

序列识别或匹配问题——即从给定集合中确定可能包含短查询核苷酸序列的参考序列子集——在计算生物学的许多重要任务中都很关键,比如宏基因组学和泛基因组分析。由于此类分析的复杂性以及参考集合的规模巨大,针对这个问题的资源高效解决方案至关重要。这带来了三重挑战:用一种高效查询、内存占用小且能很好地扩展到大型集合的数据结构来表示参考集合。为了解决这个问题,我们描述了一种高效的带色德布鲁因图索引,它是由k - 元字典与压缩倒排索引组合而成。所提出的索引充分利用了带色压缩德布鲁因图中的单元重叠群是单色的这一事实(即一个单元重叠群中的所有k - 元都具有相同的起源参考集,或颜色)。具体而言,单元重叠群按颜色顺序保存在字典中,从而允许以每个单元重叠群仅1 + o(1) 比特的方式对从k - 元到其颜色的映射进行编码。因此,每个单元重叠群的一种颜色几乎不占用空间/时间开销地存储在索引中。通过将此属性与针对整数列表的简单但有效的压缩方法相结合,该索引实现了非常小的空间占用。我们在一个名为Fulgor的工具中实现了这些方法,并进行了广泛的实验分析,以证明我们的工具相对于先前解决方案的改进。例如,与在索引空间与查询时间权衡方面最强的竞争对手Themisto相比,Fulgor所需空间显著更少(对于150,000个肠炎沙门氏菌基因组的集合,空间减少多达43%),颜色查询速度至少快两倍,构建速度快2 - 6[公式:见原文] 倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b590/10810250/914ac878e491/13015_2024_251_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验