Guerra Aníbal, Lotero Jaime, Aedo José Édinson, Isaza Sebastián
Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela.
Facultad de Ingeniería, Universidad de Antioquia (UdeA), Medellín, Colombia.
Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.
The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.
基因组数据的指数级增长最近促使人们开发压缩算法,以应对生物信息学中心的存储容量限制。理论上,参考压缩器比非参考压缩器能实现更高的压缩率;然而,最新的工具尚未能够充分发挥这种潜力。为了实现这一目标,需要一种有效的编码模型来表示输入与参考之间的差异。在本文中,我们介绍了一种用于FASTQ文件参考压缩的新方法。我们压缩方案的核心是一个参考压缩器,它基于局部比对与针对长读段优化的二进制编码相结合。在这里,我们展示了为我们的读段压缩算法UdeACompress开发的算法和性能测试。与现有技术中最好的程序相比,我们的压缩器在压缩长读段时取得了最佳结果,在压缩短读段时也具有有竞争力的压缩率。此外,与类似工具相比,它还显示出合理的执行时间和内存消耗。