Matos Luís M O, Neves António J R, Pratas Diogo, Pinho Armando J
Signal Processing Lab, IEETA/DETI, University of Aveiro, 3810-193 Aveiro, Portugal.
PLoS One. 2015 Mar 27;10(3):e0116082. doi: 10.1371/journal.pone.0116082. eCollection 2015.
In the last decade, the cost of genomic sequencing has been decreasing so much that researchers all over the world accumulate huge amounts of data for present and future use. These genomic data need to be efficiently stored, because storage cost is not decreasing as fast as the cost of sequencing. In order to overcome this problem, the most popular general-purpose compression tool, gzip, is usually used. However, these tools were not specifically designed to compress this kind of data, and often fall short when the intention is to reduce the data size as much as possible. There are several compression algorithms available, even for genomic data, but very few have been designed to deal with Whole Genome Alignments, containing alignments between entire genomes of several species. In this paper, we present a lossless compression tool, MAFCO, specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from 34% to 57%, depending on the data set. When compared to a recent dedicated method, which is not compatible with some data sets, the compression gain of MAFCO is about 9%. Both source-code and binaries for several operating systems are freely available for non-commercial use at: http://bioinformatics.ua.pt/software/mafco.
在过去十年中,基因组测序成本大幅下降,以至于世界各地的研究人员积累了大量数据以供当前和未来使用。这些基因组数据需要有效存储,因为存储成本的下降速度不如测序成本快。为了克服这个问题,通常会使用最流行的通用压缩工具gzip。然而,这些工具并非专门为压缩此类数据而设计,在旨在尽可能减小数据大小的情况下往往效果不佳。即使对于基因组数据,也有几种压缩算法可用,但专门设计用于处理包含多个物种整个基因组之间比对的全基因组比对的算法却很少。在本文中,我们提出了一种无损压缩工具MAFCO,专门用于压缩MAF(多重比对格式)文件。与gzip相比,根据数据集的不同,该工具的压缩率提高了34%至57%。与一种最近的专用方法相比(该方法与某些数据集不兼容),MAFCO的压缩率提高了约9%。用于多个操作系统的源代码和二进制文件可在以下网址免费获取供非商业使用:http://bioinformatics.ua.pt/software/mafco 。