在使用k-mer索引查找最大精确匹配时，比较固定采样和最小化采样。

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

作者信息

Almutairy Meznah, Torng Eric

机构信息

Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.

Department of Computer Science, College of Computer and Information Sciences, Imam Muhammad ibn Saud Islamic University, Riyadh, Saudi Arabia.

出版信息

PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.

DOI:10.1371/journal.pone.0189960

PMID:29389989

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5794061/

Abstract

Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.

摘要

生物信息学应用程序和流程越来越多地使用k-mer索引来搜索相似序列。k-mer索引的主要问题在于它们需要大量内存。采样通常用于减小索引大小和查询时间。大多数应用程序使用两种主要采样类型之一：固定采样和最小化器采样。众所周知，固定采样会产生较小的索引，通常大约缩小为原来的二分之一，而一般认为最小化器采样会产生更快的查询时间，因为查询k-mer也可以进行采样。然而，尚未对固定采样和最小化器采样进行直接比较以验证这些假设。我们以人类基因组作为数据库，系统地比较固定采样和最小化器采样。我们使用固定采样和最小化器采样得到的k-mer索引，在我们的数据库（人类基因组）与三个单独的查询集（小鼠基因组、黑猩猩基因组和一个NGS数据集）之间查找所有最大精确匹配。我们得出以下结论。首先，使用更大的k-mer会减少固定采样和最小化器采样的查询时间，但代价是需要更多空间。如果我们对两种方法使用相同的k-mer大小，固定采样通常需要的空间只有一半，而最小化器采样处理查询的速度仅略快一点。如果我们可以为每种方法使用任何k-mer大小，那么我们可以选择一个k-mer大小使得固定采样不仅使用更少的空间，而且处理查询的速度比最小化器采样更快。原因是尽管最小化器采样能够对查询k-mer进行采样，但对于最小化器采样而言，必须处理的共享k-mer出现次数比固定采样要多得多。总之，我们认为对于任何必须处理每个共享k-mer出现情况的应用，固定采样是正确的采样方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0801/5794061/d278494107a1/pone.0189960.g001.jpg

相似文献

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.

The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

PLoS One. 2017 Jul 7;12(7):e0179046. doi: 10.1371/journal.pone.0179046. eCollection 2017.

Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.

Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.

Efficient minimizer orders for large values of using minimum decycling sets.

Genome Res. 2023 Jul;33(7):1154-1161. doi: 10.1101/gr.277644.123. Epub 2023 Aug 9.

Weighted minimizer sampling improves long read mapping.

Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.

Data Set-Adaptive Minimizer Order Reduces Memory Usage in -Mer Counting.

J Comput Biol. 2022 Aug;29(8):825-838. doi: 10.1089/cmb.2021.0599. Epub 2022 May 6.

A simple refined DNA minimizer operator enables 2-fold faster computation.

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae045.

Squeakr: an exact and approximate k-mer counting system.

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.

MEM-based pangenome indexing for -mer queries.

bioRxiv. 2024 May 22:2024.05.20.595044. doi: 10.1101/2024.05.20.595044.

引用本文的文献

Creating and Using Minimizer Sketches in Computational Genomics.

J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

Sequence-specific minimizers via polar sets.

Bioinformatics. 2021 Jul 12;37(Suppl_1):i187-i195. doi: 10.1093/bioinformatics/btab313.

A performant bridge between fixed-size and variable-size seeding.

BMC Bioinformatics. 2020 Jul 23;21(1):328. doi: 10.1186/s12859-020-03642-y.

本文引用的文献

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

J Comput Biol. 2018 Jul;25(7):766-779. doi: 10.1089/cmb.2018.0036. Epub 2018 Apr 30.

The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

PLoS One. 2017 Jul 7;12(7):e0179046. doi: 10.1371/journal.pone.0179046. eCollection 2017.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Bioinformatics. 2016 Jul 15;32(14):2103-10. doi: 10.1093/bioinformatics/btw152. Epub 2016 Mar 19.

A Long Fragment Aligner called ALFALFA.

BMC Bioinformatics. 2015 May 15;16(1):159. doi: 10.1186/s12859-015-0533-0.

On the representation of de Bruijn graphs.

J Comput Biol. 2015 May;22(5):336-52. doi: 10.1089/cmb.2014.0160. Epub 2015 Jan 28.

KMC 2: fast and resource-frugal k-mer counting.

Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.

E-MEM: efficient computation of maximal exact matches for very large genomes.

Bioinformatics. 2015 Feb 15;31(4):509-14. doi: 10.1093/bioinformatics/btu687. Epub 2014 Oct 17.

Kraken: ultrafast metagenomic sequence classification using exact alignments.

Genome Biol. 2014 Mar 3;15(3):R46. doi: 10.1186/gb-2014-15-3-r46.

Scalable metagenomic taxonomy classification using a reference genome database.

Bioinformatics. 2013 Sep 15;29(18):2253-60. doi: 10.1093/bioinformatics/btt389. Epub 2013 Jul 4.

essaMEM: finding maximal exact matches using enhanced sparse suffix arrays.

Bioinformatics. 2013 Mar 15;29(6):802-4. doi: 10.1093/bioinformatics/btt042. Epub 2013 Jan 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在使用k-mer索引查找最大精确匹配时，比较固定采样和最小化采样。

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献