Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW 2007, Australia.
School of Computing, National University of Singapore, Singapore 117417.
Bioinformatics. 2017 Nov 1;33(21):3364-3372. doi: 10.1093/bioinformatics/btx412.
The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio.
We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark dataset of eight human genomes. HiRGC takes <30 min to compress about 21 gigabytes of each set of the seven target genomes into 96-260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent.
The C ++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC.
Supplementary data are available at Bioinformatics online.
高通量测序平台和组装算法生成的基因组数量迅速增加,随之而来的数据存储、压缩和通信问题也不断涌现。由于 DNA 序列具有小字母表大小、频繁重复和回文等固有挑战性特征,传统的压缩算法由于无法满足高压缩比的需求。基于参考的无损压缩方法,即只存储两个相似基因组之间的差异,是一种很有前途的方法,具有很高的压缩比。
我们提出了一种名为 HiRGC 的高性能参考基因组压缩算法。它基于 2 位编码方案和哈希表上的高级贪婪匹配搜索。我们在一个包含 8 个人类基因组的基准数据集上,将 HiRGC 的性能与四种最先进的压缩方法进行了比较。HiRGC 只需不到 30 分钟即可将 7 个目标基因组中的每一组约 21 吉字节压缩到 96-260 兆字节,压缩比为 217 到 82 倍。这一性能至少比最佳竞争算法在最佳情况下好 1.9 倍。我们的压缩速度也至少快 2.9 倍。HiRGC 对于不同的参考基因组稳定且鲁棒。相比之下,竞争方法在不同的参考基因组上的性能差异很大。对 1000 基因组计划中的 100 个人类基因组和其他几个物种的基因组的更多实验再次表明,HiRGC 的性能始终优异。
我们算法的 C++和 Java 源代码可免费用于学术和非商业用途。它们可以从 https://github.com/yuansliu/HiRGC 下载。
补充数据可在《生物信息学》在线获取。