用于基因组序列数据的数据结构和压缩算法。

Data structures and compression algorithms for genomic sequence data.

作者信息

Brandon Marty C, Wallace Douglas C, Baldi Pierre

机构信息

Department of Computer Science, UCI, Irvine, CA 92697, USA.

出版信息

Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.

DOI:10.1093/bioinformatics/btp319

PMID:19447783

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2705231/

Abstract

MOTIVATION

The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.

RESULTS

The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed.

AVAILABILITY

Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression.

摘要

动机

包括完整二倍体人类基因组在内的全基因组数据持续呈指数级积累，这不仅给理解基因组结构、功能和进化带来了新挑战，也给基因组数据的存储、导航和隐私保护带来了新挑战。在此，我们开发了数据结构和算法，用于高效存储基因组及其他序列数据，这也可能有助于数据查询和保护。

结果

总体思路是仅对基因组序列与参考序列之间的差异进行编码，使用差异位置的绝对或相对坐标。这些位置和相应的差异变体可以使用各种熵编码方法编码为二进制字符串，从诸如哥伦布码和埃利亚斯码等固定码，到诸如哈夫曼码等可变码。我们以高度可变的人类线粒体基因组序列为测试平台，展示了该方法及各种权衡。仅经过部分优化，以修订后的剑桥参考序列作为参考序列，GenBank中占据56MB的3615个基因组序列被压缩至仅167KB，实现了345倍的压缩率。以共有序列作为参考序列时，数据仅需133KB即可存储，对应433倍的压缩率，压缩率提高了约23%。文中还讨论了对核基因组和高通量测序数据的扩展。

可用性

数据可从GenBank、HapMap网站和MITOMAP数据库公开获取。有关更多结果、统计信息和软件实现的补充材料可从http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression获取。

相似文献

Data structures and compression algorithms for genomic sequence data.用于基因组序列数据的数据结构和压缩算法。

Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.

CoGI: Towards Compressing Genomes as an Image.CoGI：迈向将基因组压缩为图像

IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331.

Toward a Better Compression for DNA Sequences Using Huffman Encoding.使用哈夫曼编码实现对DNA序列更好的压缩

J Comput Biol. 2017 Apr;24(4):280-288. doi: 10.1089/cmb.2016.0151. Epub 2016 Dec 13.

Efficient storage of high throughput DNA sequencing data using reference-based compression.利用基于参考的压缩技术高效存储高通量 DNA 测序数据。

Genome Res. 2011 May;21(5):734-40. doi: 10.1101/gr.114819.110. Epub 2011 Jan 18.

ERGC: an efficient referential genome compression algorithm.ERGC：一种高效的参考基因组压缩算法。

Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE：使用局部一致编码提升序列压缩算法。

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

DELIMINATE--a fast and efficient method for loss-less compression of genomic sequences: sequence analysis.DELIMINATE——一种快速高效的基因组序列无损压缩方法：序列分析。

Bioinformatics. 2012 Oct 1;28(19):2527-9. doi: 10.1093/bioinformatics/bts467. Epub 2012 Jul 25.

smallWig: parallel compression of RNA-seq WIG files.smallWig：RNA序列WIG文件的并行压缩

Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.

Data structures and compression algorithms for high-throughput sequencing technologies.高通量测序技术的数据结构和压缩算法。

BMC Bioinformatics. 2010 Oct 14;11:514. doi: 10.1186/1471-2105-11-514.

iDoComp: a compression scheme for assembled genomes.iDoComp：一种用于组装基因组的压缩方案。

Bioinformatics. 2015 Mar 1;31(5):626-33. doi: 10.1093/bioinformatics/btu698. Epub 2014 Oct 24.

引用本文的文献

A Hybrid Data-Differencing and Compression Algorithm for the Automotive Industry.一种用于汽车行业的混合数据差分与压缩算法。

Entropy (Basel). 2022 Apr 19;24(5):574. doi: 10.3390/e24050574.

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.查包算法：生物分子序列的可能性和分层参考表示及其在多重序列比对压缩中的应用。

PLoS One. 2022 Apr 18;17(4):e0265360. doi: 10.1371/journal.pone.0265360. eCollection 2022.

Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。

Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.

Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.用于组装基因组的垂直无损基因组数据压缩工具：系统文献回顾。

PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020.

Tackling the Challenges of FASTQ Referential Compression.应对FASTQ参考压缩的挑战。

Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.

TRCMGene: A two-step referential compression method for the efficient storage of genetic data.TRCMGene：一种两步参考压缩方法，用于高效存储遗传数据。

PLoS One. 2018 Nov 5;13(11):e0206521. doi: 10.1371/journal.pone.0206521. eCollection 2018.

Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application.高通量测序在个性化医学中的应用综述：研究和临床应用未来进展的障碍和促进因素。

Brief Bioinform. 2019 Sep 27;20(5):1795-1811. doi: 10.1093/bib/bby051.

Algorithms designed for compressed-gene-data transformation among gene banks with different references.用于在具有不同参照的基因库之间进行压缩基因数据转换的算法。

BMC Bioinformatics. 2018 Jun 18;19(1):230. doi: 10.1186/s12859-018-2230-2.

NRGC: a novel referential genome compression algorithm.NRGC：一种新型的参考基因组压缩算法。

Bioinformatics. 2016 Nov 15;32(22):3405-3412. doi: 10.1093/bioinformatics/btw505. Epub 2016 Aug 2.

Bitpacking techniques for indexing genomes: I. Hash tables.用于基因组索引的位包装技术：I. 哈希表

Algorithms Mol Biol. 2016 Apr 18;11:5. doi: 10.1186/s13015-016-0069-5. eCollection 2016.

本文引用的文献

The YH database: the first Asian diploid genome database.YH数据库：首个亚洲二倍体基因组数据库。

Nucleic Acids Res. 2009 Jan;37(Database issue):D1025-8. doi: 10.1093/nar/gkn966.

Human genomes as email attachments.作为电子邮件附件的人类基因组。

Bioinformatics. 2009 Jan 15;25(2):274-5. doi: 10.1093/bioinformatics/btn582. Epub 2008 Nov 7.

The diploid genome sequence of an Asian individual.一名亚洲个体的二倍体基因组序列。

Nature. 2008 Nov 6;456(7218):60-5. doi: 10.1038/nature07484.

MITOMASTER: a bioinformatics tool for the analysis of mitochondrial DNA sequences.线粒体序列分析工具：一款用于分析线粒体DNA序列的生物信息学工具。

Hum Mutat. 2009 Jan;30(1):1-6. doi: 10.1002/humu.20801.

The complete genome of an individual by massively parallel DNA sequencing.通过大规模平行DNA测序获得个体的完整基因组。

Nature. 2008 Apr 17;452(7189):872-6. doi: 10.1038/nature06884.

DNA sequencing. A plan to capture human diversity in 1000 genomes.DNA测序。一项在千人基因组计划中捕捉人类多样性的计划。

Science. 2008 Jan 25;319(5862):395. doi: 10.1126/science.319.5862.395.

DNA transposons and the evolution of eukaryotic genomes.DNA转座子与真核生物基因组的进化

Annu Rev Genet. 2007;41:331-68. doi: 10.1146/annurev.genet.40.110405.090448.

Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.使用整数熵编码对化学指纹进行无损压缩可改善存储和检索。

J Chem Inf Model. 2007 Nov-Dec;47(6):2098-109. doi: 10.1021/ci700200n. Epub 2007 Oct 30.

A second generation human haplotype map of over 3.1 million SNPs.一张包含超过310万个单核苷酸多态性的第二代人类单倍型图谱。

Nature. 2007 Oct 18;449(7164):851-61. doi: 10.1038/nature06258.

The diploid genome sequence of an individual human.某个人类个体的二倍体基因组序列。

PLoS Biol. 2007 Sep 4;5(10):e254. doi: 10.1371/journal.pbio.0050254.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验