• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

BEETL-fastq:一种用于DNA读数的可搜索压缩存档。

BEETL-fastq: a searchable compressed archive for DNA reads.

作者信息

Janin Lilian, Schulz-Trieglaff Ole, Cox Anthony J

机构信息

Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK.

出版信息

Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.

DOI:10.1093/bioinformatics/btu387
PMID:24950811
Abstract

MOTIVATION

FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input.

RESULTS

We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization.

AVAILABILITY AND IMPLEMENTATION

BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.

摘要

动机

FASTQ是一种用于DNA测序数据的标准文件格式,它同时存储核苷酸和质量得分。一项典型的测序研究很容易生成数百GB的FASTQ文件,而诸如ENA和NCBI等公共档案库以及像癌症基因组图谱这样的大型国际合作项目,能够以这种格式积累数TB的数据。诸如gzip之类的压缩工具常被用于减轻存储负担,但缺点是数据在使用前必须解压缩。在此,我们展示了BEETL-fastq,这是一种工具,它不仅能比gzip更紧凑地压缩FASTQ格式的DNA读数,还能在存档序列中快速搜索k-mer查询。重要的是,每个匹配读数或读对的完整FASTQ记录都会被返回,从而使搜索结果能够直接输入到众多接受FASTQ数据作为输入的标准工具中。

结果

我们表明,6.6TB的FASTQ格式人类读数可以转换为1.7TB的索引文件,从中我们可以分别在3秒、8秒、14秒、45秒和567秒内搜索1个、10个、100个、1000个和100万个30聚体,每个输出读数还需额外20毫秒。文中突出了搜索功能的一些有用应用,包括结构变异断点的基因分型以及“虚拟下拉”实验,即在变异检测或可视化时,仅选择性提取覆盖感兴趣区域的读数。

可用性与实现方式

BEETL-fastq是BEETL库的一部分,可在github.com/BEETL/BEETL上作为github仓库获取。

相似文献

1
BEETL-fastq: a searchable compressed archive for DNA reads.BEETL-fastq:一种用于DNA读数的可搜索压缩存档。
Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.
2
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.
3
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.利用布劳尔-惠勒变换对基因组序列数据库进行大规模压缩。
Bioinformatics. 2012 Jun 1;28(11):1415-9. doi: 10.1093/bioinformatics/bts173. Epub 2012 May 3.
4
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads:算法与分析。
Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.
5
CIndex: compressed indexes for fast retrieval of FASTQ files.CIndex:用于快速检索FASTQ文件的压缩索引。
Bioinformatics. 2022 Jan 3;38(2):335-343. doi: 10.1093/bioinformatics/btab655.
6
Indexing -mers in linear space for quality value compression.用于质量值压缩的线性空间中的索引k-mer。
J Bioinform Comput Biol. 2019 Oct;17(5):1940011. doi: 10.1142/S0219720019400110.
7
Adaptive reference-free compression of sequence quality scores.自适应无参考序列质量评分压缩。
Bioinformatics. 2014 Jan 1;30(1):24-30. doi: 10.1093/bioinformatics/btt257. Epub 2013 May 9.
8
FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets.FQC:一种用于高效压缩、存档和传播Fastq数据集的新方法。
J Bioinform Comput Biol. 2015 Jun;13(3):1541003. doi: 10.1142/S0219720015410036. Epub 2015 Feb 8.
9
SPRING: a next-generation compressor for FASTQ data.SPRING:FASTQ 数据的下一代压缩程序。
Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.
10
UNDR ROVER - a fast and accurate variant caller for targeted DNA sequencing.UNDR ROVER——一种用于靶向DNA测序的快速且准确的变异检测工具。
BMC Bioinformatics. 2016 Apr 16;17:165. doi: 10.1186/s12859-016-1014-9.

引用本文的文献

1
A Pipeline for Constructing Reference Genomes for Large Cohort-Specific Metagenome Compression.一种用于构建大型队列特异性宏基因组压缩参考基因组的流程。
Microorganisms. 2023 Oct 14;11(10):2560. doi: 10.3390/microorganisms11102560.
2
Scalable sequence database search using partitioned aggregated Bloom comb trees.基于分区聚合布隆过滤树的可扩展序列数据库搜索。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.
3
Navigating bottlenecks and trade-offs in genomic data analysis.基因组数据分析中的瓶颈与权衡。
Nat Rev Genet. 2023 Apr;24(4):235-250. doi: 10.1038/s41576-022-00551-z. Epub 2022 Dec 7.
4
Data structures based on -mers for querying large collections of sequencing data sets.基于 - 元的序列数据集查询的大型数据集的数据结构。
Genome Res. 2021 Jan;31(1):1-12. doi: 10.1101/gr.260604.119. Epub 2020 Dec 16.
5
REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.驯鹿:测序数据集中小段序列存在和丰度的高效索引。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i177-i185. doi: 10.1093/bioinformatics/btaa487.
6
Tackling the Challenges of FASTQ Referential Compression.应对FASTQ参考压缩的挑战。
Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.
7
ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data.ARSDA:一种存储、传输和分析转录组数据的新方法。
G3 (Bethesda). 2017 Dec 4;7(12):3839-3848. doi: 10.1534/g3.117.300271.
8
LW-FQZip 2: a parallelized reference-based compression of FASTQ files.LW-FQZip 2:FASTQ文件的并行化基于参考的压缩
BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.
9
Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes.使用无参考压缩数据结构分析来自数千个人类基因组的测序读数。
Genome Res. 2017 Feb;27(2):300-309. doi: 10.1101/gr.211748.116. Epub 2016 Dec 16.
10
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.使用概率性德布鲁因图对高通量测序数据进行无参考压缩
BMC Bioinformatics. 2015 Sep 14;16:288. doi: 10.1186/s12859-015-0709-7.