• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在使用k-mer索引查找最大精确匹配时,比较固定采样和最小化采样。

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

作者信息

Almutairy Meznah, Torng Eric

机构信息

Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.

Department of Computer Science, College of Computer and Information Sciences, Imam Muhammad ibn Saud Islamic University, Riyadh, Saudi Arabia.

出版信息

PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.

DOI:10.1371/journal.pone.0189960
PMID:29389989
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5794061/
Abstract

Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.

摘要

生物信息学应用程序和流程越来越多地使用k-mer索引来搜索相似序列。k-mer索引的主要问题在于它们需要大量内存。采样通常用于减小索引大小和查询时间。大多数应用程序使用两种主要采样类型之一:固定采样和最小化器采样。众所周知,固定采样会产生较小的索引,通常大约缩小为原来的二分之一,而一般认为最小化器采样会产生更快的查询时间,因为查询k-mer也可以进行采样。然而,尚未对固定采样和最小化器采样进行直接比较以验证这些假设。我们以人类基因组作为数据库,系统地比较固定采样和最小化器采样。我们使用固定采样和最小化器采样得到的k-mer索引,在我们的数据库(人类基因组)与三个单独的查询集(小鼠基因组、黑猩猩基因组和一个NGS数据集)之间查找所有最大精确匹配。我们得出以下结论。首先,使用更大的k-mer会减少固定采样和最小化器采样的查询时间,但代价是需要更多空间。如果我们对两种方法使用相同的k-mer大小,固定采样通常需要的空间只有一半,而最小化器采样处理查询的速度仅略快一点。如果我们可以为每种方法使用任何k-mer大小,那么我们可以选择一个k-mer大小使得固定采样不仅使用更少的空间,而且处理查询的速度比最小化器采样更快。原因是尽管最小化器采样能够对查询k-mer进行采样,但对于最小化器采样而言,必须处理的共享k-mer出现次数比固定采样要多得多。总之,我们认为对于任何必须处理每个共享k-mer出现情况的应用,固定采样是正确的采样方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0801/5794061/520f9fef2f69/pone.0189960.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0801/5794061/d278494107a1/pone.0189960.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0801/5794061/520f9fef2f69/pone.0189960.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0801/5794061/d278494107a1/pone.0189960.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0801/5794061/520f9fef2f69/pone.0189960.g002.jpg

相似文献

1
Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.在使用k-mer索引查找最大精确匹配时,比较固定采样和最小化采样。
PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.
2
The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.抽样对k-mer索引的效率和准确性的影响:使用人类基因组的理论与实证比较
PLoS One. 2017 Jul 7;12(7):e0179046. doi: 10.1371/journal.pone.0179046. eCollection 2017.
3
Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.通过查询 K -mer 的固定采样和索引 K-mer 的布隆过滤实现最大精确匹配的快速检测。
Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.
4
Efficient minimizer orders for large values of using minimum decycling sets.利用最小去环集对大 值 进行有效最小化排序。
Genome Res. 2023 Jul;33(7):1154-1161. doi: 10.1101/gr.277644.123. Epub 2023 Aug 9.
5
Weighted minimizer sampling improves long read mapping.加权最小化抽样提高长读测序数据的比对。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.
6
Data Set-Adaptive Minimizer Order Reduces Memory Usage in -Mer Counting.数据集自适应最小化器阶数降低了-mer计数中的内存使用量。
J Comput Biol. 2022 Aug;29(8):825-838. doi: 10.1089/cmb.2021.0599. Epub 2022 May 6.
7
A simple refined DNA minimizer operator enables 2-fold faster computation.一个简单的改进 DNA 简化操作符可以使计算速度提高 2 倍。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae045.
8
Squeakr: an exact and approximate k-mer counting system.Squeakr:一种精确和近似的 k-mer 计数系统。
Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.
9
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.Kmerind:一种用于分布式内存系统上生物序列的 K-mer 索引的灵活并行库。
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.
10
MEM-based pangenome indexing for -mer queries.基于MEM的用于k-mer查询的泛基因组索引
bioRxiv. 2024 May 22:2024.05.20.595044. doi: 10.1101/2024.05.20.595044.

引用本文的文献

1
Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。
J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.
2
Sequence-specific minimizers via polar sets.通过极集实现序列特异性最小化。
Bioinformatics. 2021 Jul 12;37(Suppl_1):i187-i195. doi: 10.1093/bioinformatics/btab313.
3
A performant bridge between fixed-size and variable-size seeding.一种在定长和变长播种之间的高性能桥梁。

本文引用的文献

1
A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.一种将长读段映射到大型参考数据库的快速近似算法。
J Comput Biol. 2018 Jul;25(7):766-779. doi: 10.1089/cmb.2018.0036. Epub 2018 Apr 30.
2
The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.抽样对k-mer索引的效率和准确性的影响:使用人类基因组的理论与实证比较
PLoS One. 2017 Jul 7;12(7):e0179046. doi: 10.1371/journal.pone.0179046. eCollection 2017.
3
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
BMC Bioinformatics. 2020 Jul 23;21(1):328. doi: 10.1186/s12859-020-03642-y.
Minimap和miniasm:用于有噪声长序列的快速映射和从头组装。
Bioinformatics. 2016 Jul 15;32(14):2103-10. doi: 10.1093/bioinformatics/btw152. Epub 2016 Mar 19.
4
A Long Fragment Aligner called ALFALFA.一个名为ALFALFA的长片段比对工具。
BMC Bioinformatics. 2015 May 15;16(1):159. doi: 10.1186/s12859-015-0533-0.
5
On the representation of de Bruijn graphs.关于德布鲁因图的表示。
J Comput Biol. 2015 May;22(5):336-52. doi: 10.1089/cmb.2014.0160. Epub 2015 Jan 28.
6
KMC 2: fast and resource-frugal k-mer counting.KMC 2:快速且资源节约型的k-mer计数法
Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.
7
E-MEM: efficient computation of maximal exact matches for very large genomes.E-MEM:高效计算非常大基因组的最大精确匹配。
Bioinformatics. 2015 Feb 15;31(4):509-14. doi: 10.1093/bioinformatics/btu687. Epub 2014 Oct 17.
8
Kraken: ultrafast metagenomic sequence classification using exact alignments.克拉肯:使用精确比对的超快速宏基因组序列分类
Genome Biol. 2014 Mar 3;15(3):R46. doi: 10.1186/gb-2014-15-3-r46.
9
Scalable metagenomic taxonomy classification using a reference genome database.基于参考基因组数据库的可扩展宏基因组分类学分类。
Bioinformatics. 2013 Sep 15;29(18):2253-60. doi: 10.1093/bioinformatics/btt389. Epub 2013 Jul 4.
10
essaMEM: finding maximal exact matches using enhanced sparse suffix arrays.essaMEM:使用增强型稀疏后缀数组查找最大精确匹配。
Bioinformatics. 2013 Mar 15;29(6):802-4. doi: 10.1093/bioinformatics/btt042. Epub 2013 Jan 24.