Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China.
Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China.
Genome Biol. 2023 Apr 17;24(1):76. doi: 10.1186/s13059-023-02906-z.
Whole -genome sequencing projects of millions of subjects contain enormous genotypes, entailing a huge memory burden and time for computation. Here, we present GBC, a toolkit for rapidly compressing large-scale genotypes into highly addressable byte-encoding blocks under an optimized parallel framework. We demonstrate that GBC is up to 1000 times faster than state-of-the-art methods to access and manage compressed large-scale genotypes while maintaining a competitive compression ratio. We also showed that conventional analysis would be substantially sped up if built on GBC to access genotypes of a large population. GBC's data structure and algorithms are valuable for accelerating large-scale genomic research.
全基因组测序项目涉及数以百万计的个体,包含巨大的基因型数据,这对存储和计算资源带来了巨大的负担。在这里,我们提出了 GBC,这是一个在优化的并行框架下,将大规模基因型快速压缩成可寻址字节编码块的工具包。我们证明,GBC 比最先进的方法在访问和管理压缩的大规模基因型时快 1000 倍,同时保持有竞争力的压缩比。我们还表明,如果在 GBC 的基础上构建访问大型人群基因型的方法,常规分析的速度将会大大提高。GBC 的数据结构和算法对于加速大规模基因组研究具有重要价值。