Suppr超能文献

通过查询 K -mer 的固定采样和索引 K-mer 的布隆过滤实现最大精确匹配的快速检测。

Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.

机构信息

Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, NSW 2007, Australia.

School of Information Technology, Deakin University, VIC 3216, Australia.

出版信息

Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.

Abstract

MOTIVATION

Detection of maximal exact matches (MEMs) between two long sequences is a fundamental problem in pairwise reference-query genome comparisons. To efficiently compare larger and larger genomes, reducing the number of indexed k-mers as well as the number of query k-mers has been adopted as a mainstream approach which saves the computational resources by avoiding a significant number of unnecessary matches.

RESULTS

Under this framework, we proposed a new method to detect all MEMs from a pair of genomes. The method first performs a fixed sampling of k-mers on the query sequence, and adds these selected k-mers to a Bloom filter. Then all the k-mers of the reference sequence are tested by the Bloom filter. If a k-mer passes the test, it is inserted into a hash table for indexing. Compared with the existing methods, much less number of query k-mers are generated and much less k-mers are inserted into the index to avoid unnecessary matches, leading to an efficient matching process and memory usage savings. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. This performance is mainly attributed to the key novelty of our method that the fixed k-mer sampling must be conducted on the query sequence and the index k-mers are filtered from the reference sequence via a Bloom filter.

AVAILABILITY AND IMPLEMENTATION

https://github.com/yuansliu/bfMEM.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在两个长序列之间检测最大精确匹配(MEMs)是进行成对参考-查询基因组比较的基本问题。为了有效地比较越来越大的基因组,减少索引的 k-mer 数量和查询 k-mer 数量已被采用为主流方法,通过避免大量不必要的匹配来节省计算资源。

结果

在这个框架下,我们提出了一种从一对基因组中检测所有 MEMs 的新方法。该方法首先对查询序列进行固定的 k-mer 采样,并将这些选择的 k-mer 添加到布隆过滤器中。然后,通过布隆过滤器测试参考序列的所有 k-mer。如果 k-mer 通过测试,则将其插入哈希表以供索引。与现有方法相比,生成的查询 k-mer 数量要少得多,插入索引的 k-mer 数量要少得多,以避免不必要的匹配,从而实现高效的匹配过程和节省内存使用。在大型基因组上的实验表明,我们的方法至少比现有算法中的最佳算法快 1.8 倍。这种性能主要归因于我们的方法的关键新颖性,即固定的 k-mer 采样必须在查询序列上进行,并且索引 k-mer 通过布隆过滤器从参考序列中过滤。

可用性和实现

https://github.com/yuansliu/bfMEM。

补充信息

补充数据可在 Bioinformatics 在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验