Suppr超能文献

富尔戈尔:一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。

Fulgor: A fast and compact -mer index for large-scale matching and color queries.

作者信息

Fan Jason, Singh Noor Pratap, Khan Jamshed, Pibiri Giulio Ermanno, Patro Rob

机构信息

Department of Computer Science, University of Maryland, College Park, MD 20440, USA.

DAIS, Ca' Foscari University of Venice, Venice, Italy.

出版信息

bioRxiv. 2023 May 20:2023.05.09.539895. doi: 10.1101/2023.05.09.539895.

Abstract

The problem of sequence identification or matching - determining the subset of references from a given collection that are likely to contain a query nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resourceefficient solution to this problem is of utmost importance. The reference collection should therefore be pre-processed into an for fast queries. This poses the threefold challenge of designing an index that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, -mer dictionaries can be combined with a compressed inverted index to implement a fast and compact graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are (all -mers in a unitig have the same set of references of origin, or "color"), leveraging the property of its dictionary. In fact, -mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from -mers to their inverted lists in as little as 1+(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2 - 6× faster to construct.

摘要

序列识别或匹配问题——从给定集合中确定可能包含查询核苷酸序列的参考文献子集——在计算生物学的许多重要任务中都很重要,例如宏基因组学和泛基因组分析。由于此类分析的复杂性以及参考文献集合的规模巨大,因此找到一种资源高效的解决方案至关重要。因此,应将参考文献集合预处理成一种索引以便进行快速查询。这带来了三重挑战,即设计一种查询效率高、内存使用少且能很好地扩展到大型集合的索引。为了解决这个问题,我们描述了如何将关联的、保序的k - mer字典的最新进展与压缩倒排索引相结合,以实现一种快速且紧凑的彩色图数据结构。该索引充分利用了彩色德布鲁因图中的单元重叠群是单色的这一事实(一个单元重叠群中的所有k - mer都具有相同的起源参考文献集,即“颜色”),利用了其字典的单色属性。实际上,字典按单元重叠群顺序保存k - mer,从而允许将从k - mer到其倒排列表的映射编码为每个单元重叠群仅1 + o(√n) 位。因此,每个单元重叠群的一个倒排列表几乎没有空间/时间开销地存储在索引中。通过将此属性与简单但有效的倒排列表压缩方法相结合,该索引实现了非常小的空间占用。我们在一个名为Fulgor的工具中实现了这些方法。与之前的技术水平Themisto相比,Fulgor在索引30691个细菌基因组的异构集合时占用空间减少3.8倍,在索引150000个病毒基因组的集合时占用空间减少约2倍,在进行颜色查询时速度至少快两倍,并且构建速度快2 - 6倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f891/10201636/d5ceca8f2700/nihpp-2023.05.09.539895v2-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验