Department of Information Engineering, University of Padova, 35131 Padova, Italy.
Bioinformatics. 2022 Jun 27;38(13):3343-3350. doi: 10.1093/bioinformatics/btac180.
The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis.
In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset.
SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS.
Supplementary data are available at Bioinformatics online.
在许多大规模下一代测序数据集的复杂分析中,包括基因组学中的读取分类和 RNA-seq 数据集的特征描述,k-mer 的提取都是一个基本组成部分。由于数据的大小和要考虑的 k-mer 的指数数量,提取所有 k-mer 及其频率在运行时间和内存方面要求极高。然而,在许多应用中,分析只需要出现频率较高的 k-mer,即出现频率相对较高的数据中的 k-mer。
在这项工作中,我们提出了 SPRISS,这是一种新的高效算法,可以近似下一代测序数据中的频繁 k-mer 及其频率。SPRISS 使用一种简单而强大的读取采样方案,该方案允许提取数据集的代表性子集,然后可以与任何 k-mer 计数算法结合使用,以便在分析整个数据集所需时间的一小部分内执行下游分析,同时获得可比的答案。我们广泛的实验评估证明了 SPRISS 在近似频繁 k-mer 方面的效率和准确性,并表明它可以在各种场景中使用,例如宏基因组数据集的比较、有区别的 k-mer 的识别以及 SNP(单核苷酸多态性)基因分型,以在分析整个数据集所需时间的一小部分内提取见解。
SPRISS(Santoro 等人,2021 年在 RECOMB 2021 上展示的此工作的初步版本)可在 https://github.com/VandinLab/SPRISS 上获得。
补充数据可在《生物信息学》在线获得。