Facultad de Ingeniería, Universidad de la República, Montevideo, 11300, Uruguay.
Facultad de Ciencias, Universidad de la República, Montevideo, 11400, Uruguay.
Bioinformatics. 2021 Dec 11;37(24):4862-4864. doi: 10.1093/bioinformatics/btab437.
Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in <72 h). To reduce the costs of transmission and storage, efficient compression methods for this type of data are needed.
We introduce RENANO, a reference-based lossless data compressor specifically tailored to FASTQ files generated with nanopore sequencing technologies. RENANO improves on its predecessor ENANO, currently the state of the art, by providing a more efficient base call sequence compression component. Two compression algorithms are introduced, corresponding to the following scenarios: (1) a reference genome is available without cost to both the compressor and the decompressor and (2) the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed file. We compare the compression performance of RENANO against ENANO on several publicly available nanopore datasets. RENANO improves the base call sequences compression of ENANO by 39.8% in scenario (1), and by 33.5% in scenario (2), on average, over all the datasets. As for total file compression, the average improvements are 12.7% and 10.6%, respectively. We also show that RENANO consistently outperforms the recent general-purpose genomic compressor Genozip.
RENANO is freely available for download at: https://github.com/guilledufort/RENANO.
Supplementary data are available at Bioinformatics online.
纳米孔测序技术之所以迅速普及,部分原因是它们能够在短时间内(在<72 小时内产生多达 8.5 TB 的数据)生成大量的基因组数据。为了降低传输和存储成本,需要针对这种类型的数据开发高效的压缩方法。
我们引入了 RENANO,这是一种基于参考的无损数据压缩器,专门针对纳米孔测序技术生成的 FASTQ 文件进行了优化。RENO 改进了其前身 ENANO,目前是该领域的最新技术,通过提供更高效的碱基调用序列压缩组件。引入了两种压缩算法,对应以下两种情况:(1) 压缩器和解压缩器都可以免费获得参考基因组,(2) 参考基因组仅在压缩器一侧可用,并且在压缩文件中包含参考基因组的压缩版本。我们在几个公开可用的纳米孔数据集上比较了 RENANO 和 ENANO 的压缩性能。RENOANO 在场景 (1) 中平均将 ENANO 的碱基调用序列压缩提高了 39.8%,在场景 (2) 中平均提高了 33.5%,在所有数据集上。至于总文件压缩,平均改进分别为 12.7%和 10.6%。我们还表明,RENOANO 始终优于最近的通用基因组压缩器 Genozip。
RENOANO 可在以下网址免费下载:https://github.com/guilledufort/RENANO。
补充数据可在 Bioinformatics 在线获取。