Suppr超能文献

用于基因组序列数据的数据结构和压缩算法。

Data structures and compression algorithms for genomic sequence data.

作者信息

Brandon Marty C, Wallace Douglas C, Baldi Pierre

机构信息

Department of Computer Science, UCI, Irvine, CA 92697, USA.

出版信息

Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.

Abstract

MOTIVATION

The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.

RESULTS

The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed.

AVAILABILITY

Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression.

摘要

动机

包括完整二倍体人类基因组在内的全基因组数据持续呈指数级积累,这不仅给理解基因组结构、功能和进化带来了新挑战,也给基因组数据的存储、导航和隐私保护带来了新挑战。在此,我们开发了数据结构和算法,用于高效存储基因组及其他序列数据,这也可能有助于数据查询和保护。

结果

总体思路是仅对基因组序列与参考序列之间的差异进行编码,使用差异位置的绝对或相对坐标。这些位置和相应的差异变体可以使用各种熵编码方法编码为二进制字符串,从诸如哥伦布码和埃利亚斯码等固定码,到诸如哈夫曼码等可变码。我们以高度可变的人类线粒体基因组序列为测试平台,展示了该方法及各种权衡。仅经过部分优化,以修订后的剑桥参考序列作为参考序列,GenBank中占据56MB的3615个基因组序列被压缩至仅167KB,实现了345倍的压缩率。以共有序列作为参考序列时,数据仅需133KB即可存储,对应433倍的压缩率,压缩率提高了约23%。文中还讨论了对核基因组和高通量测序数据的扩展。

可用性

数据可从GenBank、HapMap网站和MITOMAP数据库公开获取。有关更多结果、统计信息和软件实现的补充材料可从http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression获取。

相似文献

1
Data structures and compression algorithms for genomic sequence data.用于基因组序列数据的数据结构和压缩算法。
Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.
2
CoGI: Towards Compressing Genomes as an Image.CoGI:迈向将基因组压缩为图像
IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331.
3
Toward a Better Compression for DNA Sequences Using Huffman Encoding.使用哈夫曼编码实现对DNA序列更好的压缩
J Comput Biol. 2017 Apr;24(4):280-288. doi: 10.1089/cmb.2016.0151. Epub 2016 Dec 13.
5
ERGC: an efficient referential genome compression algorithm.ERGC:一种高效的参考基因组压缩算法。
Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.
8
smallWig: parallel compression of RNA-seq WIG files.smallWig:RNA序列WIG文件的并行压缩
Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.
10
iDoComp: a compression scheme for assembled genomes.iDoComp:一种用于组装基因组的压缩方案。
Bioinformatics. 2015 Mar 1;31(5):626-33. doi: 10.1093/bioinformatics/btu698. Epub 2014 Oct 24.

引用本文的文献

3
Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。
Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.
5
Tackling the Challenges of FASTQ Referential Compression.应对FASTQ参考压缩的挑战。
Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.
9
NRGC: a novel referential genome compression algorithm.NRGC:一种新型的参考基因组压缩算法。
Bioinformatics. 2016 Nov 15;32(22):3405-3412. doi: 10.1093/bioinformatics/btw505. Epub 2016 Aug 2.
10
Bitpacking techniques for indexing genomes: I. Hash tables.用于基因组索引的位包装技术:I. 哈希表
Algorithms Mol Biol. 2016 Apr 18;11:5. doi: 10.1186/s13015-016-0069-5. eCollection 2016.

本文引用的文献

1
The YH database: the first Asian diploid genome database.YH数据库:首个亚洲二倍体基因组数据库。
Nucleic Acids Res. 2009 Jan;37(Database issue):D1025-8. doi: 10.1093/nar/gkn966.
2
Human genomes as email attachments.作为电子邮件附件的人类基因组。
Bioinformatics. 2009 Jan 15;25(2):274-5. doi: 10.1093/bioinformatics/btn582. Epub 2008 Nov 7.
7
DNA transposons and the evolution of eukaryotic genomes.DNA转座子与真核生物基因组的进化
Annu Rev Genet. 2007;41:331-68. doi: 10.1146/annurev.genet.40.110405.090448.
10
The diploid genome sequence of an individual human.某个人类个体的二倍体基因组序列。
PLoS Biol. 2007 Sep 4;5(10):e254. doi: 10.1371/journal.pbio.0050254.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验