Suppr超能文献

用于基因组序列的紧凑且均匀分布的k-mer分箱

Compact and evenly distributed k-mer binning for genomic sequences.

作者信息

Nyström-Persson Johan, Keeble-Gagnère Gabriel, Zawad Niamat

机构信息

JNP Solutions, Yokoami, Sumida-ku, Tokyo 130-0015, Japan.

Department of R&D, Lifematics Inc., Kanda Jinbocho, Chiyoda-ku, Tokyo 101-0051, Japan.

出版信息

Bioinformatics. 2021 Sep 9;37(17):2563-2569. doi: 10.1093/bioinformatics/btab156.

Abstract

MOTIVATION

The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers-ordered m-mers where m < k-are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored.

RESULTS

We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available.

AVAILABILITY AND IMPLEMENTATION

Discount is GPL licensed and available at https://github.com/jtnystrom/discount. The data underlying this article are available in the article and in its online supplementary material.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

k-mer(长度为k的子序列)处理是生物信息学中许多序列处理算法的基础,包括用于基因组大小估计、基因组组装的k-mer计数以及宏基因组学的分类学分类。在这种处理的第一步中,通常使用最小化器排序的m-mer(其中m < k)将k-mer分组到各个桶中。然而,已知最小化器会生成大小差异很大的桶,这可能给分布式和并行处理带来挑战,并且通常会增加内存需求。此外,尽管已经提出了各种最小化器排序方法,但它们在提高工具效率方面的实际价值尚未得到充分探索。

结果

我们展示了Discount,这是一种基于Apache Spark的分布式k-mer计数工具,我们用它来研究各种最小化器排序在应用于宏基因组学数据时的实际行为。然后,使用这个工具,我们引入了通用频率排序,这是频率采样最小化器和通用k-mer命中集的一种新组合,它既能产生均匀分布的分箱,又能得到小的桶大小。我们表明,这种排序使得Discount在仅相当于可比方法八分之一的内存中就能对大型数据集执行分布式k-mer计数,使其成为可用的最有效的核外分布式k-mer计数方法。

可用性和实现

Discount采用GPL许可,可在https://github.com/jtnystrom/discount获取。本文所依据的数据可在文章及其在线补充材料中获取。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1f1/8428581/a32ba29b5d9f/btab156f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验