Suppr超能文献

BEETL-fastq:一种用于DNA读数的可搜索压缩存档。

BEETL-fastq: a searchable compressed archive for DNA reads.

作者信息

Janin Lilian, Schulz-Trieglaff Ole, Cox Anthony J

机构信息

Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK.

出版信息

Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.

Abstract

MOTIVATION

FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input.

RESULTS

We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization.

AVAILABILITY AND IMPLEMENTATION

BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.

摘要

动机

FASTQ是一种用于DNA测序数据的标准文件格式,它同时存储核苷酸和质量得分。一项典型的测序研究很容易生成数百GB的FASTQ文件,而诸如ENA和NCBI等公共档案库以及像癌症基因组图谱这样的大型国际合作项目,能够以这种格式积累数TB的数据。诸如gzip之类的压缩工具常被用于减轻存储负担,但缺点是数据在使用前必须解压缩。在此,我们展示了BEETL-fastq,这是一种工具,它不仅能比gzip更紧凑地压缩FASTQ格式的DNA读数,还能在存档序列中快速搜索k-mer查询。重要的是,每个匹配读数或读对的完整FASTQ记录都会被返回,从而使搜索结果能够直接输入到众多接受FASTQ数据作为输入的标准工具中。

结果

我们表明,6.6TB的FASTQ格式人类读数可以转换为1.7TB的索引文件,从中我们可以分别在3秒、8秒、14秒、45秒和567秒内搜索1个、10个、100个、1000个和100万个30聚体,每个输出读数还需额外20毫秒。文中突出了搜索功能的一些有用应用,包括结构变异断点的基因分型以及“虚拟下拉”实验,即在变异检测或可视化时,仅选择性提取覆盖感兴趣区域的读数。

可用性与实现方式

BEETL-fastq是BEETL库的一部分,可在github.com/BEETL/BEETL上作为github仓库获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验