基因组k-mer计数表的空间高效表示。

Space-efficient representation of genomic k-mer count tables.

作者信息

Shibuya Yoshihiro, Belazzougui Djamal, Kucherov Gregory

机构信息

LIGM, Université Gustave Eiffel, Marne-la-Vallée, France.

CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, DZ, Algeria.

出版信息

Algorithms Mol Biol. 2022 Mar 21;17(1):5. doi: 10.1186/s13015-022-00212-0.

DOI:10.1186/s13015-022-00212-0

PMID:35317833

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8939220/

Abstract

MOTIVATION

k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general.

RESULTS

In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k's.

摘要

动机

k-mer计数是生物信息学流程中的一项常见任务，有许多专用工具可用。这些工具中的许多都会生成包含k-mer和计数的输出k-mer计数值表，其大小很容易达到数十GB。此外，一般来说，这样的表不支持高效的随机访问查询。

结果

在这项工作中，我们设计了一种k-mer计数值表的高效表示方法，支持快速随机访问查询。我们建议应用压缩静态函数（CSF），其空间与计数的经验零阶熵成正比。对于非常偏斜的分布，比如全基因组中的k-mer计数分布，目前唯一可用的CSF实现并不能提供足够紧凑的表示。通过在CSF中添加布隆过滤器，我们得到了布隆增强CSF（BCSF），有效地克服了这一限制。此外，通过将BCSF与基于最小化器的k-mer分桶相结合，对于足够大的k，我们构建了更小的表示，打破了经验熵下限。我们还将这些表示扩展到近似情况，从而获得更多空间。我们在全基因组（大肠杆菌和秀丽隐杆线虫）和未组装读数的k-mer计数值表以及29个大肠杆菌基因组的k-mer文档频率表上对这些技术进行了实验验证。在精确计数的情况下，对于足够大的k，我们的表示占用的空间约为经验熵的一半。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b094/8939220/9e01ca46a99a/13015_2022_212_Fig1_HTML.jpg

相似文献

Space-efficient representation of genomic k-mer count tables.基因组k-mer计数表的空间高效表示。

Algorithms Mol Biol. 2022 Mar 21;17(1):5. doi: 10.1186/s13015-022-00212-0.

Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to -Mer Annotation.集最小草图：用于幂律分布的概率图及其在 -Mer 注释中的应用。

J Comput Biol. 2022 Feb;29(2):140-154. doi: 10.1089/cmb.2021.0429. Epub 2022 Jan 18.

Squeakr: an exact and approximate k-mer counting system.Squeakr：一种精确和近似的 k-mer 计数系统。

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

On weighted k-mer dictionaries.关于加权k-元字典。

Algorithms Mol Biol. 2023 Jun 17;18(1):3. doi: 10.1186/s13015-023-00226-2.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.这些不是你要找的k-mer：使用概率数据结构进行高效在线k-mer计数。

PLoS One. 2014 Jul 25;9(7):e101271. doi: 10.1371/journal.pone.0101271. eCollection 2014.

Space-efficient computation of k-mer dictionaries for large values of k.针对大k值的k-mer字典进行节省空间的计算。

Algorithms Mol Biol. 2024 Apr 5;19(1):14. doi: 10.1186/s13015-024-00259-1.

KCOSS: an ultra-fast k-mer counter for assembled genome analysis.KCOSS：用于组装基因组分析的超快速k-mer计数器。

Bioinformatics. 2022 Jan 27;38(4):933-940. doi: 10.1093/bioinformatics/btab797.

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.使用k-mer布隆过滤器提高序列数据上的布隆过滤器性能。

J Comput Biol. 2017 Jun;24(6):547-557. doi: 10.1089/cmb.2016.0155. Epub 2016 Nov 9.

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.在使用k-mer索引查找最大精确匹配时，比较固定采样和最小化采样。

PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.

Representation of -Mer Sets Using Spectrum-Preserving String Sets.使用谱保持串集表示 -Mer 集。

J Comput Biol. 2021 Apr;28(4):381-394. doi: 10.1089/cmb.2020.0431. Epub 2020 Dec 7.

引用本文的文献

Space-efficient computation of k-mer dictionaries for large values of k.针对大k值的k-mer字典进行节省空间的计算。

Algorithms Mol Biol. 2024 Apr 5;19(1):14. doi: 10.1186/s13015-024-00259-1.

Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。

J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment.使用 Apache Spark 分布式计算环境进行生物信息学分析的十个快速技巧。

PLoS Comput Biol. 2023 Jul 20;19(7):e1011272. doi: 10.1371/journal.pcbi.1011272. eCollection 2023 Jul.

Locality-preserving minimal perfect hashing of k-mers.保局最小完美哈希的 k- -mer。

Bioinformatics. 2023 Jun 30;39(Suppl 1):i534-i543. doi: 10.1093/bioinformatics/btad219.

On weighted k-mer dictionaries.关于加权k-元字典。

Algorithms Mol Biol. 2023 Jun 17;18(1):3. doi: 10.1186/s13015-023-00226-2.

Sparse and skew hashing of K-mers.K- -mer 的稀疏和偏斜哈希。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.

本文引用的文献

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets.一种用于高效找到近似最优通用命中集的随机并行算法。

Res Comput Mol Biol. 2020 May;12074:37-53. doi: 10.1007/978-3-030-45257-5_3. Epub 2020 Apr 21.

MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata.MQF 和缓冲 MQF：用于高效存储具有计数和元数据的 k-mer 的商滤波器。

BMC Bioinformatics. 2021 Feb 16;22(1):71. doi: 10.1186/s12859-021-03996-x.

'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees.“多间隔词匹配法”：一种使用多个间隔词匹配和四重树进行系统发育重建的最大似然法。

NAR Genom Bioinform. 2019 Oct 30;2(1):lqz013. doi: 10.1093/nargab/lqz013. eCollection 2020 Mar.

Nebula: ultra-efficient mapping-free structural variant genotyper.星云：超高效免图结构变异基因分型器。

Nucleic Acids Res. 2021 May 7;49(8):e47. doi: 10.1093/nar/gkab025.

Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length.通过具有短剩余路径长度的小型通用命中集进行低密度选择方案。

J Comput Biol. 2021 Apr;28(4):395-409. doi: 10.1089/cmb.2020.0432. Epub 2020 Dec 15.

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.驯鹿：测序数据集中小段序列存在和丰度的高效索引。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i177-i185. doi: 10.1093/bioinformatics/btaa487.

Sparse Binary Relation Representations for Genome Graph Annotation.用于基因组图注释的稀疏二元关系表示

J Comput Biol. 2020 Apr;27(4):626-639. doi: 10.1089/cmb.2019.0324. Epub 2019 Dec 20.

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs.迈向完美读段：通过在 De Bruijn 图上进行映射来自我纠正短读段。

Bioinformatics. 2020 Mar 1;36(5):1374-1381. doi: 10.1093/bioinformatics/btz102.

SeqOthello: querying RNA-seq experiments at scale.SeqOthello：大规模查询 RNA-seq 实验。

Genome Biol. 2018 Oct 19;19(1):167. doi: 10.1186/s13059-018-1535-9.

Association mapping from sequencing reads using -mers.基于 -mers 的测序reads 的关联作图。

Elife. 2018 Jun 13;7:e32920. doi: 10.7554/eLife.32920.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基因组k-mer计数表的空间高效表示。

Space-efficient representation of genomic k-mer count tables.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献