用于基因组序列的紧凑且均匀分布的k-mer分箱

Compact and evenly distributed k-mer binning for genomic sequences.

作者信息

Nyström-Persson Johan, Keeble-Gagnère Gabriel, Zawad Niamat

机构信息

JNP Solutions, Yokoami, Sumida-ku, Tokyo 130-0015, Japan.

Department of R&D, Lifematics Inc., Kanda Jinbocho, Chiyoda-ku, Tokyo 101-0051, Japan.

出版信息

Bioinformatics. 2021 Sep 9;37(17):2563-2569. doi: 10.1093/bioinformatics/btab156.

DOI:10.1093/bioinformatics/btab156

PMID:33693556

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8428581/

Abstract

MOTIVATION

The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers-ordered m-mers where m < k-are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored.

RESULTS

We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available.

AVAILABILITY AND IMPLEMENTATION

Discount is GPL licensed and available at https://github.com/jtnystrom/discount. The data underlying this article are available in the article and in its online supplementary material.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

k-mer（长度为k的子序列）处理是生物信息学中许多序列处理算法的基础，包括用于基因组大小估计、基因组组装的k-mer计数以及宏基因组学的分类学分类。在这种处理的第一步中，通常使用最小化器排序的m-mer（其中m < k）将k-mer分组到各个桶中。然而，已知最小化器会生成大小差异很大的桶，这可能给分布式和并行处理带来挑战，并且通常会增加内存需求。此外，尽管已经提出了各种最小化器排序方法，但它们在提高工具效率方面的实际价值尚未得到充分探索。

结果

我们展示了Discount，这是一种基于Apache Spark的分布式k-mer计数工具，我们用它来研究各种最小化器排序在应用于宏基因组学数据时的实际行为。然后，使用这个工具，我们引入了通用频率排序，这是频率采样最小化器和通用k-mer命中集的一种新组合，它既能产生均匀分布的分箱，又能得到小的桶大小。我们表明，这种排序使得Discount在仅相当于可比方法八分之一的内存中就能对大型数据集执行分布式k-mer计数，使其成为可用的最有效的核外分布式k-mer计数方法。

可用性和实现

Discount采用GPL许可，可在https://github.com/jtnystrom/discount获取。本文所依据的数据可在文章及其在线补充材料中获取。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1f1/8428581/a32ba29b5d9f/btab156f1.jpg

相似文献

Compact and evenly distributed k-mer binning for genomic sequences.用于基因组序列的紧凑且均匀分布的k-mer分箱

Bioinformatics. 2021 Sep 9;37(17):2563-2569. doi: 10.1093/bioinformatics/btab156.

Improved design and analysis of practical minimizers.实用极小化器的改进设计与分析。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.

Sequence-specific minimizers via polar sets.通过极集实现序列特异性最小化。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i187-i195. doi: 10.1093/bioinformatics/btab313.

Improving the performance of minimizers and winnowing schemes.提高最小化器和淘汰方案的性能。

Bioinformatics. 2017 Jul 15;33(14):i110-i117. doi: 10.1093/bioinformatics/btx235.

Efficient minimizer orders for large values of using minimum decycling sets.利用最小去环集对大值进行有效最小化排序。

Genome Res. 2023 Jul;33(7):1154-1161. doi: 10.1101/gr.277644.123. Epub 2023 Aug 9.

Squeakr: an exact and approximate k-mer counting system.Squeakr：一种精确和近似的 k-mer 计数系统。

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

Data Set-Adaptive Minimizer Order Reduces Memory Usage in -Mer Counting.数据集自适应最小化器阶数降低了-mer计数中的内存使用量。

J Comput Biol. 2022 Aug;29(8):825-838. doi: 10.1089/cmb.2021.0599. Epub 2022 May 6.

Theory of local k-mer selection with applications to long-read alignment.基于局部 k-mer 选择的理论及其在长读测序比对中的应用。

Bioinformatics. 2022 Oct 14;38(20):4659-4669. doi: 10.1093/bioinformatics/btab790.

A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours.一种通用的、近精确的低内存消耗 k-mer 计数方法，可在 2.7 小时内完成 106×人类序列数据的从头组装。

Bioinformatics. 2020 Dec 30;36(Suppl_2):i625-i633. doi: 10.1093/bioinformatics/btaa890.

Sparse and skew hashing of K-mers.K- -mer 的稀疏和偏斜哈希。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.

引用本文的文献

Evolution and related pathogenic genes of Pseudodiploöspora longispora on Morchella based on genomic characterization and comparative genomic analysis.基于基因组特征和比较基因组分析的羊肚菌上长拟盘多毛孢的进化及其相关致病基因。

Sci Rep. 2024 Aug 10;14(1):18588. doi: 10.1038/s41598-024-69421-4.

Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme.广义掩蔽最小化草图方案的密度和守恒优化。

J Comput Biol. 2024 Jan;31(1):2-20. doi: 10.1089/cmb.2023.0212. Epub 2023 Nov 17.

Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。

J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

Efficient minimizer orders for large values of using minimum decycling sets.利用最小去环集对大值进行有效最小化排序。

Genome Res. 2023 Jul;33(7):1154-1161. doi: 10.1101/gr.277644.123. Epub 2023 Aug 9.

Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment.使用 Apache Spark 分布式计算环境进行生物信息学分析的十个快速技巧。

PLoS Comput Biol. 2023 Jul 20;19(7):e1011272. doi: 10.1371/journal.pcbi.1011272. eCollection 2023 Jul.

High-quality haplotype-resolved genome assembly of cultivated octoploid strawberry.栽培八倍体草莓的高质量单倍型解析基因组组装

Hortic Res. 2023 Jan 4;10(1):uhad002. doi: 10.1093/hr/uhad002. eCollection 2023 Jan.

Framing Apache Spark in life sciences.从生命科学角度构建Apache Spark

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

How to optimally sample a sequence for rapid analysis.如何最优地采样序列以进行快速分析。

Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad057.

本文引用的文献

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets.一种用于高效找到近似最优通用命中集的随机并行算法。

Res Comput Mol Biol. 2020 May;12074:37-53. doi: 10.1007/978-3-030-45257-5_3. Epub 2020 Apr 21.

Improved design and analysis of practical minimizers.实用极小化器的改进设计与分析。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.

Weighted minimizer sampling improves long read mapping.加权最小化抽样提高长读测序数据的比对。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics.分析基因组序列的大数据集：快速可扩展的 k-mer 统计信息收集。

BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):138. doi: 10.1186/s12859-019-2694-8.

A benchmark study of k-mer counting methods for high-throughput sequencing.用于高通量测序的 k-mer 计数方法的基准研究。

Gigascience. 2018 Dec 1;7(12):giy125. doi: 10.1093/gigascience/giy125.

Mapping-free variant calling using haplotype reconstruction from k-mer frequencies.基于 k- -mer 频率的单倍型重构进行无图谱变异调用。

Bioinformatics. 2018 May 15;34(10):1659-1665. doi: 10.1093/bioinformatics/btx753.

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing.设计小型通用k-mer命中集以改进对高通量测序的分析

PLoS Comput Biol. 2017 Oct 2;13(10):e1005777. doi: 10.1371/journal.pcbi.1005777. eCollection 2017 Oct.

Improving the performance of minimizers and winnowing schemes.提高最小化器和淘汰方案的性能。

Bioinformatics. 2017 Jul 15;33(14):i110-i117. doi: 10.1093/bioinformatics/btx235.

KMC 3: counting and manipulating k-mer statistics.KMC 3：计算和处理k-mer统计信息。

Bioinformatics. 2017 Sep 1;33(17):2759-2761. doi: 10.1093/bioinformatics/btx304.

Gerbil: a fast and memory-efficient -mer counter with GPU-support.沙鼠：一种支持GPU的快速且内存高效的-mer计数器。

Algorithms Mol Biol. 2017 Mar 31;12:9. doi: 10.1186/s13015-017-0097-9. eCollection 2017.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于基因组序列的紧凑且均匀分布的k-mer分箱

Compact and evenly distributed k-mer binning for genomic sequences.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献