• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

GSC:实现 VCF 文件的高效无损压缩和快速查询

GSC: efficient lossless compression of VCF files with fast query.

机构信息

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.

BGI Research, Wuhan 430074, China.

出版信息

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.

DOI:10.1093/gigascience/giae046
PMID:39028587
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11258903/
Abstract

BACKGROUND

With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives.

FINDINGS

To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5× to 6.5× greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2× to 2× slower than GBC, yet 1.1× to 1.4× faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies.

CONCLUSION

GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.

摘要

背景

随着大规模基因组测序项目的兴起,对数千个样本的基因分型产生了巨大的变体调用格式 (VCF) 文件。存储、传输和分析这些大量文件变得越来越具有挑战性。压缩方法已被用于解决这些问题,旨在实现高压缩比和快速随机访问。然而,现有的方法尚未在这两个目标之间取得令人满意的折衷。

发现

为了解决上述问题,我们引入了 GSC(基因型稀疏压缩),这是一种专门针对 VCF 文件的无损压缩工具。在对各种开源数据集进行的基准测试中,GSC 在基因型数据压缩方面表现出色。与业界最先进的工具(即 GBC 和 GTC)相比,GSC 在数据集上分别比 GBC 和 GTC 高出 26.9%至 82.4%的压缩比。在无损压缩场景中,GSC 也表现出了强大的性能,压缩比比 gzip、zstd 和 BCFtools 等通用工具高 1.5 倍到 6.5 倍——GBC 和 GTC 均不支持这种模式。实现如此高的压缩比确实需要一些合理的权衡,包括更长的解压缩时间,GSC 比 GBC 慢 1.2 倍到 2 倍,但比 GTC 快 1.1 倍到 1.4 倍。此外,GSC 保持了与竞争对手相当的解压缩查询速度。在 RAM 使用方面,GSC 优于两个竞争对手。总的来说,GSC 的综合性能超过了最先进的技术。

结论

GSC 在实现高压缩比的同时兼顾快速数据访问,增强了基因组数据管理。它支持与 PLINK 二进制格式的无缝转换,简化了下游分析。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/c17eb97c754c/giae046fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/54b13bebbbbb/giae046fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/1a19004bffdf/giae046fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/4892211da19a/giae046fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/3d86ac9e6a80/giae046fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/a176910eeb17/giae046fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/650fd6238eb7/giae046fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/7db97075bbe7/giae046fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/c5ca07bcf7cc/giae046fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/c17eb97c754c/giae046fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/54b13bebbbbb/giae046fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/1a19004bffdf/giae046fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/4892211da19a/giae046fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/3d86ac9e6a80/giae046fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/a176910eeb17/giae046fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/650fd6238eb7/giae046fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/7db97075bbe7/giae046fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/c5ca07bcf7cc/giae046fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c5a/11258903/c17eb97c754c/giae046fig9.jpg

相似文献

1
GSC: efficient lossless compression of VCF files with fast query.GSC:实现 VCF 文件的高效无损压缩和快速查询
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.
2
LCQS: an efficient lossless compression tool of quality scores with random access functionality.LCQS:一种具有随机访问功能的高效无损质量评分压缩工具。
BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.
3
genozip: a fast and efficient compression tool for VCF files.genozip:一种用于 VCF 文件的快速高效压缩工具。
Bioinformatics. 2020 Jul 1;36(13):4091-4092. doi: 10.1093/bioinformatics/btaa290.
4
smallWig: parallel compression of RNA-seq WIG files.smallWig:RNA序列WIG文件的并行压缩
Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.
5
BUSZ: compressed BUS files.BUSZ:压缩的 BUS 文件。
Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad295.
6
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.
7
SeqArray-a storage-efficient high-performance data format for WGS variant calls.SeqArray——一种用于全基因组测序变异检测的存储高效的高性能数据格式。
Bioinformatics. 2017 Aug 1;33(15):2251-2257. doi: 10.1093/bioinformatics/btx145.
8
mspack: efficient lossless and lossy mass spectrometry data compression.mspack:高效的无损和有损质谱数据压缩。
Bioinformatics. 2021 Nov 5;37(21):3923-3925. doi: 10.1093/bioinformatics/btab636.
9
LFQC: a lossless compression algorithm for FASTQ files.LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.
10
CMIC: an efficient quality score compressor with random access functionality.CMIC:一种具有随机访问功能的高效质量得分压缩器。
BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.

引用本文的文献

1
Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.
2
A benchmark study of compression software for human short-read sequence data.人类短读长序列数据压缩软件的基准研究。
Sci Rep. 2025 May 2;15(1):15358. doi: 10.1038/s41598-025-00491-8.
3
Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可用于分析的VCF。

本文引用的文献

1
GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species.GBC:一种基于高度可寻址字节编码块的并行工具包,用于处理物种的超大规模基因型。
Genome Biol. 2023 Apr 17;24(1):76. doi: 10.1186/s13059-023-02906-z.
2
GVC: efficient random access compression for gene sequence variations.GVC:基因序列变异的高效随机访问压缩。
BMC Bioinformatics. 2023 Mar 28;24(1):121. doi: 10.1186/s12859-023-05240-0.
3
The sequences of 150,119 genomes in the UK Biobank.英国生物库中 150119 个基因组的序列。
bioRxiv. 2025 Feb 6:2024.06.11.598241. doi: 10.1101/2024.06.11.598241.
Nature. 2022 Jul;607(7920):732-740. doi: 10.1038/s41586-022-04965-x. Epub 2022 Jul 20.
4
XSI-a genotype compression tool for compressive genomics in large biobanks.XSI-a 基因型压缩工具,用于大型生物库中的压缩基因组学。
Bioinformatics. 2022 Aug 2;38(15):3778-3784. doi: 10.1093/bioinformatics/btac413.
5
Sparse allele vectors and the savvy software suite.稀疏等位基因向量和精明的软件套件。
Bioinformatics. 2021 Nov 18;37(22):4248-4250. doi: 10.1093/bioinformatics/btab378.
6
VCFShark: how to squeeze a VCF file.VCFShark:如何压缩一个VCF文件。
Bioinformatics. 2021 Oct 11;37(19):3358-3360. doi: 10.1093/bioinformatics/btab211.
7
HTSlib: C library for reading/writing high-throughput sequencing data.HTSlib:用于读取/写入高通量测序数据的 C 库。
Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab007.
8
Twelve years of SAMtools and BCFtools.SAMtools 和 BCFtools 十二年。
Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab008.
9
genozip: a fast and efficient compression tool for VCF files.genozip:一种用于 VCF 文件的快速高效压缩工具。
Bioinformatics. 2020 Jul 1;36(13):4091-4092. doi: 10.1093/bioinformatics/btaa290.
10
GTShark: genotype compression in large projects.GTShark:大型项目中的基因型压缩。
Bioinformatics. 2019 Nov 1;35(22):4791-4793. doi: 10.1093/bioinformatics/btz508.