GBC：一种基于高度可寻址字节编码块的并行工具包，用于处理物种的超大规模基因型。

GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species.

机构信息

Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China.

Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China.

出版信息

Genome Biol. 2023 Apr 17;24(1):76. doi: 10.1186/s13059-023-02906-z.

DOI:10.1186/s13059-023-02906-z

PMID:37069653

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10108510/

Abstract

Whole -genome sequencing projects of millions of subjects contain enormous genotypes, entailing a huge memory burden and time for computation. Here, we present GBC, a toolkit for rapidly compressing large-scale genotypes into highly addressable byte-encoding blocks under an optimized parallel framework. We demonstrate that GBC is up to 1000 times faster than state-of-the-art methods to access and manage compressed large-scale genotypes while maintaining a competitive compression ratio. We also showed that conventional analysis would be substantially sped up if built on GBC to access genotypes of a large population. GBC's data structure and algorithms are valuable for accelerating large-scale genomic research.

摘要

全基因组测序项目涉及数以百万计的个体，包含巨大的基因型数据，这对存储和计算资源带来了巨大的负担。在这里，我们提出了 GBC，这是一个在优化的并行框架下，将大规模基因型快速压缩成可寻址字节编码块的工具包。我们证明，GBC 比最先进的方法在访问和管理压缩的大规模基因型时快 1000 倍，同时保持有竞争力的压缩比。我们还表明，如果在 GBC 的基础上构建访问大型人群基因型的方法，常规分析的速度将会大大提高。GBC 的数据结构和算法对于加速大规模基因组研究具有重要价值。

相似文献

GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species.GBC：一种基于高度可寻址字节编码块的并行工具包，用于处理物种的超大规模基因型。

Genome Biol. 2023 Apr 17;24(1):76. doi: 10.1186/s13059-023-02906-z.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE：使用局部一致编码提升序列压缩算法。

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

Operating on Genomic Ranges Using BEDOPS.使用BEDOPS对基因组范围进行操作。

Methods Mol Biol. 2016;1418:267-81. doi: 10.1007/978-1-4939-3578-9_14.

GSC: efficient lossless compression of VCF files with fast query.GSC：实现 VCF 文件的高效无损压缩和快速查询

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.

GTC: how to maintain huge genotype collections in a compressed form.GTC：如何以压缩形式保存大型基因型集合。

Bioinformatics. 2018 Jun 1;34(11):1834-1840. doi: 10.1093/bioinformatics/bty023.

XSI-a genotype compression tool for compressive genomics in large biobanks.XSI-a 基因型压缩工具，用于大型生物库中的压缩基因组学。

Bioinformatics. 2022 Aug 2;38(15):3778-3784. doi: 10.1093/bioinformatics/btac413.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads：算法与分析。

Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

On-Demand Indexing for Referential Compression of DNA Sequences.用于DNA序列引用压缩的按需索引

PLoS One. 2015 Jul 6;10(7):e0132460. doi: 10.1371/journal.pone.0132460. eCollection 2015.

WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams.WBFQC：一种将下一代测序数据分割为同质流进行压缩的新方法。

J Bioinform Comput Biol. 2018 Oct;16(5):1850018. doi: 10.1142/S021972001850018X. Epub 2018 Jun 28.

ERGC: an efficient referential genome compression algorithm.ERGC：一种高效的参考基因组压缩算法。

Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

引用本文的文献

Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.

GSC: efficient lossless compression of VCF files with fast query.GSC：实现 VCF 文件的高效无损压缩和快速查询

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.

Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可用于分析的VCF。

bioRxiv. 2025 Feb 6:2024.06.11.598241. doi: 10.1101/2024.06.11.598241.

本文引用的文献

GA4GH: International policies and standards for data sharing across genomic research and healthcare.全球基因组与健康联盟（GA4GH）：跨基因组研究与医疗保健领域数据共享的国际政策与标准。

Cell Genom. 2021 Nov 10;1(2). doi: 10.1016/j.xgen.2021.100029.

quickLD: An efficient software for linkage disequilibrium analyses.quickLD：一款高效的连锁不平衡分析软件。

Mol Ecol Resour. 2021 Oct;21(7):2580-2587. doi: 10.1111/1755-0998.13438. Epub 2021 Jun 19.

VCFShark: how to squeeze a VCF file.VCFShark：如何压缩一个VCF文件。

Bioinformatics. 2021 Oct 11;37(19):3358-3360. doi: 10.1093/bioinformatics/btab211.

Twelve years of SAMtools and BCFtools.SAMtools 和 BCFtools 十二年。

Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab008.

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.美国国立卫生研究院生物医学高级研究与发展局（NHLBI）TOPMed 项目中对 53831 个不同基因组进行测序。

Nature. 2021 Feb;590(7845):290-299. doi: 10.1038/s41586-021-03205-y. Epub 2021 Feb 10.

genozip: a fast and efficient compression tool for VCF files.genozip：一种用于 VCF 文件的快速高效压缩工具。

Bioinformatics. 2020 Jul 1;36(13):4091-4092. doi: 10.1093/bioinformatics/btaa290.

Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore.新加坡三个不同亚洲人群的大规模全基因组测序。

Cell. 2019 Oct 17;179(3):736-749.e15. doi: 10.1016/j.cell.2019.09.019.

GTShark: genotype compression in large projects.GTShark：大型项目中的基因型压缩。

Bioinformatics. 2019 Nov 1;35(22):4791-4793. doi: 10.1093/bioinformatics/btz508.

PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files.PopLDdecay：一种基于变体调用格式文件的快速有效的连锁不平衡衰减分析工具。

Bioinformatics. 2019 May 15;35(10):1786-1788. doi: 10.1093/bioinformatics/bty875.

The UK Biobank resource with deep phenotyping and genomic data.英国生物银行资源库，具有深度表型和基因组数据。

Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

GBC：一种基于高度可寻址字节编码块的并行工具包，用于处理物种的超大规模基因型。

GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献