Brandon Marty C, Wallace Douglas C, Baldi Pierre
Department of Computer Science, UCI, Irvine, CA 92697, USA.
Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.
The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.
The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed.
Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression.
包括完整二倍体人类基因组在内的全基因组数据持续呈指数级积累,这不仅给理解基因组结构、功能和进化带来了新挑战,也给基因组数据的存储、导航和隐私保护带来了新挑战。在此,我们开发了数据结构和算法,用于高效存储基因组及其他序列数据,这也可能有助于数据查询和保护。
总体思路是仅对基因组序列与参考序列之间的差异进行编码,使用差异位置的绝对或相对坐标。这些位置和相应的差异变体可以使用各种熵编码方法编码为二进制字符串,从诸如哥伦布码和埃利亚斯码等固定码,到诸如哈夫曼码等可变码。我们以高度可变的人类线粒体基因组序列为测试平台,展示了该方法及各种权衡。仅经过部分优化,以修订后的剑桥参考序列作为参考序列,GenBank中占据56MB的3615个基因组序列被压缩至仅167KB,实现了345倍的压缩率。以共有序列作为参考序列时,数据仅需133KB即可存储,对应433倍的压缩率,压缩率提高了约23%。文中还讨论了对核基因组和高通量测序数据的扩展。
数据可从GenBank、HapMap网站和MITOMAP数据库公开获取。有关更多结果、统计信息和软件实现的补充材料可从http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression获取。