GTRAC：从基因组变异的压缩集合中快速检索

GTRAC: fast retrieval from compressed collections of genomic variants.

作者信息

Tatwawadi Kedar, Hernaez Mikel, Ochoa Idoia, Weissman Tsachy

机构信息

Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA, USA.

出版信息

Bioinformatics. 2016 Sep 1;32(17):i479-i486. doi: 10.1093/bioinformatics/btw437.

DOI:10.1093/bioinformatics/btw437

PMID:27587665

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5013914/

Abstract

MOTIVATION

The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether.

RESULTS

We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures.

AVAILABILITY AND IMPLEMENTATION

The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.edu

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

测序成本的大幅下降导致了大量基因组数据的产生，英国10K计划和百万退伍军人计划等项目便是例证，测序基因组的数量在1万到100万的量级。由于同一物种个体的基因组序列存在大量冗余，大多数医学研究处理的是与参考序列相比的序列变异，而非完整的基因组序列。因此，数以百万计表示为变异的基因组被存储在数据库中。这些数据库不断更新和查询，以提取个体或个体群体中的常见变异等信息。以前用于压缩此类数据库的算法缺乏有效的随机访问能力，使得查询数据库中特定的变异和/或个体极其低效，以至于压缩常常被完全放弃。

结果

我们提出了一种用于此任务的新算法，称为GTRAC，它在实现显著压缩率的同时，允许对压缩数据库进行快速随机访问。例如，GTRAC能够将包含1092个样本的智人数据集压缩到1.1GB（压缩率为160），同时允许在不到一秒的时间内解压缩特定样本，并在17毫秒内解压缩特定变异。GTRAC使用并改编了信息论中的技术，如专门的Lempel-Ziv压缩器和定制的简洁数据结构。

可用性和实现

GTRAC算法可在以下网址下载：https://github.com/kedartatwawadi/GTRAC 联系方式：kedart@stanford.edu

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

GTRAC: fast retrieval from compressed collections of genomic variants.GTRAC：从基因组变异的压缩集合中快速检索

Bioinformatics. 2016 Sep 1;32(17):i479-i486. doi: 10.1093/bioinformatics/btw437.

GTC: how to maintain huge genotype collections in a compressed form.GTC：如何以压缩形式保存大型基因型集合。

Bioinformatics. 2018 Jun 1;34(11):1834-1840. doi: 10.1093/bioinformatics/bty023.

Genome compression: a novel approach for large collections.基因组压缩：一种用于大型数据集的新方法。

Bioinformatics. 2013 Oct 15;29(20):2572-8. doi: 10.1093/bioinformatics/btt460. Epub 2013 Aug 21.

ERGC: an efficient referential genome compression algorithm.ERGC：一种高效的参考基因组压缩算法。

Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

smallWig: parallel compression of RNA-seq WIG files.smallWig：RNA序列WIG文件的并行压缩

Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.

Robust relative compression of genomes with random access.具有随机访问的基因组的稳健相对压缩。

Bioinformatics. 2011 Nov 1;27(21):2979-86. doi: 10.1093/bioinformatics/btr505. Epub 2011 Sep 5.

CSAM: Compressed SAM format.CSAM：压缩 SAM 格式。

Bioinformatics. 2016 Dec 15;32(24):3709-3716. doi: 10.1093/bioinformatics/btw543. Epub 2016 Aug 18.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads：算法与分析。

Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

SPRING: a next-generation compressor for FASTQ data.SPRING：FASTQ 数据的下一代压缩程序。

Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.

High efficiency referential genome compression algorithm.高效引用基因组压缩算法。

Bioinformatics. 2019 Jun 1;35(12):2058-2065. doi: 10.1093/bioinformatics/bty934.

引用本文的文献

Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.

GSC: efficient lossless compression of VCF files with fast query.GSC：实现 VCF 文件的高效无损压缩和快速查询

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.

Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可用于分析的VCF。

bioRxiv. 2025 Feb 6:2024.06.11.598241. doi: 10.1101/2024.06.11.598241.

GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species.GBC：一种基于高度可寻址字节编码块的并行工具包，用于处理物种的超大规模基因型。

Genome Biol. 2023 Apr 17;24(1):76. doi: 10.1186/s13059-023-02906-z.

GVC: efficient random access compression for gene sequence variations.GVC：基因序列变异的高效随机访问压缩。

BMC Bioinformatics. 2023 Mar 28;24(1):121. doi: 10.1186/s12859-023-05240-0.

XSI-a genotype compression tool for compressive genomics in large biobanks.XSI-a 基因型压缩工具，用于大型生物库中的压缩基因组学。

Bioinformatics. 2022 Aug 2;38(15):3778-3784. doi: 10.1093/bioinformatics/btac413.

Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints.通过预计算的基因组指纹实现个人基因组的超快速比较。

Front Genet. 2017 Sep 26;8:136. doi: 10.3389/fgene.2017.00136. eCollection 2017.

本文引用的文献

Genome compression: a novel approach for large collections.基因组压缩：一种用于大型数据集的新方法。

Bioinformatics. 2013 Oct 15;29(20):2572-8. doi: 10.1093/bioinformatics/btt460. Epub 2013 Aug 21.

An integrated map of genetic variation from 1,092 human genomes.1092 个人类基因组遗传变异的综合图谱。

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

A public resource facilitating clinical use of genomes.一个促进基因组临床应用的公共资源。

Proc Natl Acad Sci U S A. 2012 Jul 24;109(30):11920-7. doi: 10.1073/pnas.1201904109. Epub 2012 Jul 13.

The variant call format and VCFtools.变异调用格式和 VCFtools。

Bioinformatics. 2011 Aug 1;27(15):2156-8. doi: 10.1093/bioinformatics/btr330. Epub 2011 Jun 7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验