United University, Prayagraj, Uttar Pradesh, 211012, India.
School of Computer Science Engineering and Technology, Bennett University, Greater Noida, Uttar Pradesh, 201310, India.
Funct Integr Genomics. 2023 Nov 11;23(4):333. doi: 10.1007/s10142-023-01259-x.
Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compression technique is required to reduce the time as well as the cost of storage, transmission, and data processing. General-purpose compression techniques do not perform so well for these data due to their special features: a large number of repeats (tandem and palindrome), small alphabets, and highly similar, and specific file formats. In this study, we provide a method for compressing FastQ files that uses a reference genome as a backup without sacrificing data quality. FastQ files are initially split into three streams (identifier, sequence, and quality score), each of which receives its own compression technique. A novel quick and lightweight mapping mechanism is also presented to effectively compress the sequence stream. As shown by experiments, the suggested methods, both the compression ratio and the compression/decompression duration of NGS data compressed using RBFQC, are superior to those achieved by other state-of-the-art genome compression methods. In comparison to GZIP, RBFQC may achieve a compression ratio of 80-140% for fixed-length datasets and 80-125% for variable-length datasets. Compared to domain-specific FastQ file referential genome compression techniques, RBFQC has a compression and decompression speed (total) improvement of 10-25%.
医院和医学实验室每天都会生成大量用于研究、手术和疾病诊断的基因组序列数据。为了使存储更加易于理解,因此压缩对于存储、监控和分发所有这些数据至关重要。需要一种新的数据压缩技术来减少存储、传输和数据处理的时间和成本。由于其特殊特征,通用压缩技术在处理这些数据时表现不佳:大量重复(串联和回文)、小字母表、高度相似和特定文件格式。在本研究中,我们提供了一种使用参考基因组作为备份的方法来压缩 FastQ 文件,而不会牺牲数据质量。FastQ 文件最初分为三部分(标识符、序列和质量分数),每一部分都使用自己的压缩技术。还提出了一种新颖的快速轻量级映射机制,以有效地压缩序列流。实验结果表明,所提出的方法在使用 RBFQC 压缩 NGS 数据的压缩比和压缩/解压缩时间方面均优于其他最先进的基因组压缩方法。与 GZIP 相比,RBFQC 可以为固定长度数据集实现 80-140%的压缩比,为可变长度数据集实现 80-125%的压缩比。与特定于领域的 FastQ 文件参考基因组压缩技术相比,RBFQC 在压缩和解压缩速度(总和)方面提高了 10-25%。