Christley Scott, Lu Yiming, Li Chen, Xie Xiaohui
Department of Computer Science, University of California Irvine, Irvine, CA 92697, USA.
Bioinformatics. 2009 Jan 15;25(2):274-5. doi: 10.1093/bioinformatics/btn582. Epub 2008 Nov 7.
The amount of genomic sequence data being generated and made available through public databases continues to increase at an ever-expanding rate. Downloading, copying, sharing and manipulating these large datasets are becoming difficult and time consuming for researchers. We need to consider using advanced compression techniques as part of a standard data format for genomic data. The inherent structure of genome data allows for more efficient lossless compression than can be obtained through the use of generic compression programs. We apply a series of techniques to James Watson's genome that in combination reduce it to a mere 4MB, small enough to be sent as an email attachment.
通过公共数据库生成并可用的基因组序列数据量正以不断扩大的速度持续增长。对于研究人员来说,下载、复制、共享和处理这些大型数据集变得困难且耗时。我们需要考虑使用先进的压缩技术作为基因组数据标准数据格式的一部分。基因组数据的固有结构允许实现比使用通用压缩程序更高效的无损压缩。我们对詹姆斯·沃森的基因组应用了一系列技术,这些技术相结合将其缩减至仅4MB,小到足以作为电子邮件附件发送。