Suppr超能文献

具有多个参考的序列分解

Sequence Factorization with Multiple References.

作者信息

Wandelt Sebastian, Leser Ulf

机构信息

Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany.

出版信息

PLoS One. 2015 Sep 30;10(9):e0139000. doi: 10.1371/journal.pone.0139000. eCollection 2015.

Abstract

The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

摘要

高通量测序的成功促使越来越多的项目对某一物种的大量群体进行测序。由于数据集规模庞大,序列数据的存储和分析成为这些项目中的一项关键挑战。压缩是应对这一挑战的一项简单技术。参考因式分解和压缩方案仅存储输入序列与参考序列之间的差异,在该领域引起了广泛关注。高度相似的序列,如人类基因组,可以实现1000:1甚至更高的压缩率,比标准压缩技术好两个数量级。最近有研究表明,针对同一物种的多个参考序列进行压缩,可将压缩率提高到4000:1。然而,目前缺乏对使用多个参考序列的详细分析,例如主内存消耗和最优性方面。在本文中,我们描述了针对多个参考序列进行参考压缩的一项关键技术:序列因式分解。基于最优因式分解的概念,我们提出了优化启发式方法,并确定了对以下三个方面有重大影响的参数设置:1)因式分解的规模;2)因式分解的时间;3)所需的主内存量。我们在来自三个不同物种的数据上,对总共30种不同参考序列数量的设置进行了评估。我们的结果显示了因式分解规模(从最优到高达300%的开销)、因式分解速度(0.01MB/s到超过600MB/s)和主内存使用情况(几十MB到几十GB)的广泛范围。基于我们的评估,我们确定了常见用例的最佳配置。我们的评估表明,多参考因式分解比单参考因式分解要好得多。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fff0/4589410/911a67045b10/pone.0139000.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验