College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.
BGI Research, Wuhan 430074, China.
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.
With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives.
To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5× to 6.5× greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2× to 2× slower than GBC, yet 1.1× to 1.4× faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies.
GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.
随着大规模基因组测序项目的兴起,对数千个样本的基因分型产生了巨大的变体调用格式 (VCF) 文件。存储、传输和分析这些大量文件变得越来越具有挑战性。压缩方法已被用于解决这些问题,旨在实现高压缩比和快速随机访问。然而,现有的方法尚未在这两个目标之间取得令人满意的折衷。
为了解决上述问题,我们引入了 GSC(基因型稀疏压缩),这是一种专门针对 VCF 文件的无损压缩工具。在对各种开源数据集进行的基准测试中,GSC 在基因型数据压缩方面表现出色。与业界最先进的工具(即 GBC 和 GTC)相比,GSC 在数据集上分别比 GBC 和 GTC 高出 26.9%至 82.4%的压缩比。在无损压缩场景中,GSC 也表现出了强大的性能,压缩比比 gzip、zstd 和 BCFtools 等通用工具高 1.5 倍到 6.5 倍——GBC 和 GTC 均不支持这种模式。实现如此高的压缩比确实需要一些合理的权衡,包括更长的解压缩时间,GSC 比 GBC 慢 1.2 倍到 2 倍,但比 GTC 快 1.1 倍到 1.4 倍。此外,GSC 保持了与竞争对手相当的解压缩查询速度。在 RAM 使用方面,GSC 优于两个竞争对手。总的来说,GSC 的综合性能超过了最先进的技术。
GSC 在实现高压缩比的同时兼顾快速数据访问,增强了基因组数据管理。它支持与 PLINK 二进制格式的无缝转换,简化了下游分析。