BEETL-fastq：一种用于DNA读数的可搜索压缩存档。

BEETL-fastq: a searchable compressed archive for DNA reads.

作者信息

Janin Lilian, Schulz-Trieglaff Ole, Cox Anthony J

机构信息

Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK.

出版信息

Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.

DOI:10.1093/bioinformatics/btu387

PMID:24950811

Abstract

MOTIVATION

FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input.

RESULTS

We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization.

AVAILABILITY AND IMPLEMENTATION

BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.

摘要

动机

FASTQ是一种用于DNA测序数据的标准文件格式，它同时存储核苷酸和质量得分。一项典型的测序研究很容易生成数百GB的FASTQ文件，而诸如ENA和NCBI等公共档案库以及像癌症基因组图谱这样的大型国际合作项目，能够以这种格式积累数TB的数据。诸如gzip之类的压缩工具常被用于减轻存储负担，但缺点是数据在使用前必须解压缩。在此，我们展示了BEETL-fastq，这是一种工具，它不仅能比gzip更紧凑地压缩FASTQ格式的DNA读数，还能在存档序列中快速搜索k-mer查询。重要的是，每个匹配读数或读对的完整FASTQ记录都会被返回，从而使搜索结果能够直接输入到众多接受FASTQ数据作为输入的标准工具中。

结果

我们表明，6.6TB的FASTQ格式人类读数可以转换为1.7TB的索引文件，从中我们可以分别在3秒、8秒、14秒、45秒和567秒内搜索1个、10个、100个、1000个和100万个30聚体，每个输出读数还需额外20毫秒。文中突出了搜索功能的一些有用应用，包括结构变异断点的基因分型以及“虚拟下拉”实验，即在变异检测或可视化时，仅选择性提取覆盖感兴趣区域的读数。

可用性与实现方式

BEETL-fastq是BEETL库的一部分，可在github.com/BEETL/BEETL上作为github仓库获取。

相似文献

BEETL-fastq: a searchable compressed archive for DNA reads.BEETL-fastq：一种用于DNA读数的可搜索压缩存档。

Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE：使用局部一致编码提升序列压缩算法。

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.利用布劳尔-惠勒变换对基因组序列数据库进行大规模压缩。

Bioinformatics. 2012 Jun 1;28(11):1415-9. doi: 10.1093/bioinformatics/bts173. Epub 2012 May 3.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads：算法与分析。

Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

CIndex: compressed indexes for fast retrieval of FASTQ files.CIndex：用于快速检索FASTQ文件的压缩索引。

Bioinformatics. 2022 Jan 3;38(2):335-343. doi: 10.1093/bioinformatics/btab655.

Indexing -mers in linear space for quality value compression.用于质量值压缩的线性空间中的索引k-mer。

J Bioinform Comput Biol. 2019 Oct;17(5):1940011. doi: 10.1142/S0219720019400110.

Adaptive reference-free compression of sequence quality scores.自适应无参考序列质量评分压缩。

Bioinformatics. 2014 Jan 1;30(1):24-30. doi: 10.1093/bioinformatics/btt257. Epub 2013 May 9.

FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets.FQC：一种用于高效压缩、存档和传播Fastq数据集的新方法。

J Bioinform Comput Biol. 2015 Jun;13(3):1541003. doi: 10.1142/S0219720015410036. Epub 2015 Feb 8.

SPRING: a next-generation compressor for FASTQ data.SPRING：FASTQ 数据的下一代压缩程序。

Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.

UNDR ROVER - a fast and accurate variant caller for targeted DNA sequencing.UNDR ROVER——一种用于靶向DNA测序的快速且准确的变异检测工具。

BMC Bioinformatics. 2016 Apr 16;17:165. doi: 10.1186/s12859-016-1014-9.

引用本文的文献

A Pipeline for Constructing Reference Genomes for Large Cohort-Specific Metagenome Compression.一种用于构建大型队列特异性宏基因组压缩参考基因组的流程。

Microorganisms. 2023 Oct 14;11(10):2560. doi: 10.3390/microorganisms11102560.

Scalable sequence database search using partitioned aggregated Bloom comb trees.基于分区聚合布隆过滤树的可扩展序列数据库搜索。

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.

Navigating bottlenecks and trade-offs in genomic data analysis.基因组数据分析中的瓶颈与权衡。

Nat Rev Genet. 2023 Apr;24(4):235-250. doi: 10.1038/s41576-022-00551-z. Epub 2022 Dec 7.

Data structures based on -mers for querying large collections of sequencing data sets.基于 - 元的序列数据集查询的大型数据集的数据结构。

Genome Res. 2021 Jan;31(1):1-12. doi: 10.1101/gr.260604.119. Epub 2020 Dec 16.

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.驯鹿：测序数据集中小段序列存在和丰度的高效索引。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i177-i185. doi: 10.1093/bioinformatics/btaa487.

Tackling the Challenges of FASTQ Referential Compression.应对FASTQ参考压缩的挑战。

Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data.ARSDA：一种存储、传输和分析转录组数据的新方法。

G3 (Bethesda). 2017 Dec 4;7(12):3839-3848. doi: 10.1534/g3.117.300271.

LW-FQZip 2: a parallelized reference-based compression of FASTQ files.LW-FQZip 2：FASTQ文件的并行化基于参考的压缩

BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.

Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes.使用无参考压缩数据结构分析来自数千个人类基因组的测序读数。

Genome Res. 2017 Feb;27(2):300-309. doi: 10.1101/gr.211748.116. Epub 2016 Dec 16.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.使用概率性德布鲁因图对高通量测序数据进行无参考压缩

BMC Bioinformatics. 2015 Sep 14;16:288. doi: 10.1186/s12859-015-0709-7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

BEETL-fastq：一种用于DNA读数的可搜索压缩存档。

BEETL-fastq: a searchable compressed archive for DNA reads.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性与实现方式

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献