Suppr超能文献

使用SlimGene压缩基因组序列片段。

Compressing genomic sequence fragments using SlimGene.

作者信息

Kozanitis Christos, Saunders Chris, Kruglyak Semyon, Bafna Vineet, Varghese George

机构信息

Department of Computer Science and Engineering, University of California, San Diego, California, USA.

出版信息

J Comput Biol. 2011 Mar;18(3):401-13. doi: 10.1089/cmb.2010.0253.

Abstract

With the advent of next generation sequencing technologies, the cost of sequencing whole genomes is poised to go below $1000 per human individual in a few years. As more and more genomes are sequenced, analysis methods are undergoing rapid development, making it tempting to store sequencing data for long periods of time so that the data can be re-analyzed with the latest techniques. The challenging open research problems, huge influx of data, and rapidly improving analysis techniques have created the need to store and transfer very large volumes of data. Compression can be achieved at many levels, including trace level (compressing image data), sequence level (compressing a genomic sequence), and fragment-level (compressing a set of short, redundant fragment reads, along with quality-values on the base-calls). We focus on fragment-level compression, which is the pressing need today. Our article makes two contributions, implemented in a tool, SlimGene. First, we introduce a set of domain specific loss-less compression schemes that achieve over 40× compression of fragments, outperforming bzip2 by over 6×. Including quality values, we show a 5× compression using less running time than bzip2. Second, given the discrepancy between the compression factor obtained with and without quality values, we initiate the study of using "lossy" quality values. Specifically, we show that a lossy quality value quantization results in 14× compression but has minimal impact on downstream applications like SNP calling that use the quality values. Discrepancies between SNP calls made between the lossy and loss-less versions of the data are limited to low coverage areas where even the SNP calls made by the loss-less version are marginal.

摘要

随着下一代测序技术的出现,全基因组测序的成本预计在几年内将降至每人1000美元以下。随着越来越多的基因组被测序,分析方法正在迅速发展,这使得人们倾向于长时间存储测序数据,以便能够使用最新技术对数据进行重新分析。具有挑战性的开放研究问题、大量的数据涌入以及迅速改进的分析技术,使得存储和传输非常大量的数据成为必要。压缩可以在多个层面实现,包括痕量水平(压缩图像数据)、序列水平(压缩基因组序列)和片段水平(压缩一组短的、冗余的片段读取,以及碱基调用上的质量值)。我们专注于片段水平的压缩,这是当今迫切的需求。我们的文章做出了两项贡献,并在一个名为SlimGene的工具中得以实现。首先,我们引入了一组特定领域的无损压缩方案,这些方案能够实现对片段超过40倍的压缩,比bzip2的压缩效果高出6倍以上。包括质量值在内,我们展示了使用比bzip2更少的运行时间实现5倍的压缩。其次,鉴于有无质量值时获得的压缩因子存在差异,我们开启了对使用“有损”质量值的研究。具体而言,我们表明有损质量值量化可实现14倍的压缩,但对使用质量值的下游应用(如单核苷酸多态性(SNP)检测)的影响最小。有损和无损版本数据之间的SNP检测差异仅限于低覆盖区域,在这些区域中,即使是无损版本进行的SNP检测也很勉强。

相似文献

1
2
Lossy Compression of Quality Values in Sequencing Data.测序数据中质量值的有损压缩。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Sep-Oct;18(5):1958-1969. doi: 10.1109/TCBB.2019.2959273. Epub 2021 Oct 7.
4
Lossy compression of quality scores in genomic data.基因组数据中质量分数的有损压缩。
Bioinformatics. 2014 Aug 1;30(15):2130-6. doi: 10.1093/bioinformatics/btu183. Epub 2014 Apr 10.
5
CALQ: compression of quality values of aligned sequencing data.CALQ:对齐测序数据的质量值压缩。
Bioinformatics. 2018 May 15;34(10):1650-1658. doi: 10.1093/bioinformatics/btx737.
7
ERGC: an efficient referential genome compression algorithm.ERGC:一种高效的参考基因组压缩算法。
Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

引用本文的文献

4
Tackling the Challenges of FASTQ Referential Compression.应对FASTQ参考压缩的挑战。
Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.
6
An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values.基因组测序质量值有损压缩的评估框架
Proc Data Compress Conf. 2016 Mar-Apr;2016:221-230. doi: 10.1109/DCC.2016.39. Epub 2016 Dec 19.
9
Novel bioinformatic developments for exome sequencing.外显子组测序的新型生物信息学进展
Hum Genet. 2016 Jun;135(6):603-14. doi: 10.1007/s00439-016-1658-6. Epub 2016 Apr 13.

本文引用的文献

1
The Sequence Alignment/Map format and SAMtools.序列比对/映射格式和 SAMtools。
Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8.
2
Data structures and compression algorithms for genomic sequence data.用于基因组序列数据的数据结构和压缩算法。
Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.
3
Human genomes as email attachments.作为电子邮件附件的人类基因组。
Bioinformatics. 2009 Jan 15;25(2):274-5. doi: 10.1093/bioinformatics/btn582. Epub 2008 Nov 7.
4
Structural variation of the human genome.人类基因组的结构变异
Annu Rev Genomics Hum Genet. 2006;7:407-42. doi: 10.1146/annurev.genom.7.080505.115618.
5
Structural variation in the human genome.人类基因组中的结构变异。
Nat Rev Genet. 2006 Feb;7(2):85-97. doi: 10.1038/nrg1767.
7
Detection of large-scale variation in the human genome.人类基因组中大规模变异的检测。
Nat Genet. 2004 Sep;36(9):949-51. doi: 10.1038/ng1416. Epub 2004 Aug 1.
8
DNACompress: fast and effective DNA sequence compression.DNACompress:快速有效的DNA序列压缩
Bioinformatics. 2002 Dec;18(12):1696-8. doi: 10.1093/bioinformatics/18.12.1696.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验