Suppr超能文献

SPRISS:通过读取采样来近似频繁的 k-mers 及其应用。

SPRISS: approximating frequent k-mers by sampling reads, and applications.

机构信息

Department of Information Engineering, University of Padova, 35131 Padova, Italy.

出版信息

Bioinformatics. 2022 Jun 27;38(13):3343-3350. doi: 10.1093/bioinformatics/btac180.

Abstract

MOTIVATION

The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis.

RESULTS

In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset.

AVAILABILITY AND IMPLEMENTATION

SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在许多大规模下一代测序数据集的复杂分析中,包括基因组学中的读取分类和 RNA-seq 数据集的特征描述,k-mer 的提取都是一个基本组成部分。由于数据的大小和要考虑的 k-mer 的指数数量,提取所有 k-mer 及其频率在运行时间和内存方面要求极高。然而,在许多应用中,分析只需要出现频率较高的 k-mer,即出现频率相对较高的数据中的 k-mer。

结果

在这项工作中,我们提出了 SPRISS,这是一种新的高效算法,可以近似下一代测序数据中的频繁 k-mer 及其频率。SPRISS 使用一种简单而强大的读取采样方案,该方案允许提取数据集的代表性子集,然后可以与任何 k-mer 计数算法结合使用,以便在分析整个数据集所需时间的一小部分内执行下游分析,同时获得可比的答案。我们广泛的实验评估证明了 SPRISS 在近似频繁 k-mer 方面的效率和准确性,并表明它可以在各种场景中使用,例如宏基因组数据集的比较、有区别的 k-mer 的识别以及 SNP(单核苷酸多态性)基因分型,以在分析整个数据集所需时间的一小部分内提取见解。

可用性和实现

SPRISS(Santoro 等人,2021 年在 RECOMB 2021 上展示的此工作的初步版本)可在 https://github.com/VandinLab/SPRISS 上获得。

补充信息

补充数据可在《生物信息学》在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e25/9237683/d9ce6c932c3b/btac180f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验