使用拆分序列布隆树改进对大型转录组测序数据库的搜索

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.

作者信息

Solomon Brad, Kingsford Carl

机构信息

Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.

出版信息

J Comput Biol. 2018 Jul;25(7):755-765. doi: 10.1089/cmb.2017.0265. Epub 2018 Mar 12.

DOI:10.1089/cmb.2017.0265

PMID:29641248

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6067102/

Abstract

Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. We apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.

摘要

诸如美国国立卫生研究院序列读取存档库这样的大量短读长RNA测序实验数据库现已可用。这些数据库可以回答许多关于特定条件下的表达或群体变异的问题，而且这种资源只会随着时间的推移而不断增加。然而，由于无法搜索特定的表达序列，这些数据集仍然难以使用。尽管在这个问题上已经取得了一些进展，但在数百太字节的短读长测序实验集合中进行搜索仍然不可行。我们引入了一种名为分割序列布隆树（SSBTs）的索引方案，以支持对数千个短读长测序实验的太字节规模集合进行基于序列的查询。对于相同任务，SSBT是对序列布隆树（SBT）数据结构的一种改进。我们将SSBTs应用于寻找查询转录本表达条件的问题。我们在一组针对乳腺、血液和脑组织的2652个公开可用的RNA测序实验上进行了实验。我们证明，使用单线程在不到4分钟的时间内就可以对这个SSBT索引查询1000 nt的序列，并且它仅需39 GB的存储空间，与SBT相比，搜索和存储成本提高了五倍。

相似文献

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.使用拆分序列布隆树改进对大型转录组测序数据库的搜索

J Comput Biol. 2018 Jul;25(7):755-765. doi: 10.1089/cmb.2017.0265. Epub 2018 Mar 12.

AllSome Sequence Bloom Trees.所有一些序列布隆树。

J Comput Biol. 2018 May;25(5):467-479. doi: 10.1089/cmb.2017.0258. Epub 2018 Apr 5.

Improved representation of sequence bloom trees.序列 Bloom 树的表示方法改进。

Bioinformatics. 2020 Feb 1;36(3):721-727. doi: 10.1093/bioinformatics/btz662.

Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.螳螂：一种快速、小巧、精确的大规模序列搜索索引。

Cell Syst. 2018 Aug 22;7(2):201-207.e4. doi: 10.1016/j.cels.2018.05.021. Epub 2018 Jun 20.

Fast search of thousands of short-read sequencing experiments.快速搜索数千个短读长测序实验。

Nat Biotechnol. 2016 Mar;34(3):300-2. doi: 10.1038/nbt.3442. Epub 2016 Feb 8.

Querying large read collections in main memory: a versatile data structure.在主内存中查询大型读取集合：一种通用的数据结构。

BMC Bioinformatics. 2011 Jun 17;12:242. doi: 10.1186/1471-2105-12-242.

SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis.SPARTA：用于基于参考的细菌RNA测序转录组自动分析的简单程序。

BMC Bioinformatics. 2016 Feb 4;17:66. doi: 10.1186/s12859-016-0923-y.

Using Publicly Available RNA-seq Data for Expression Analysis of Genes of Interest.利用公共 RNA-seq 数据进行感兴趣基因的表达分析。

Methods Mol Biol. 2024;2792:241-250. doi: 10.1007/978-1-0716-3802-6_19.

Transcript Profiling Using Long-Read Sequencing Technologies.使用长读长测序技术进行转录本分析

Methods Mol Biol. 2018;1783:121-147. doi: 10.1007/978-1-4939-7834-2_6.

TACO produces robust multisample transcriptome assemblies from RNA-seq.TACO可从RNA测序中生成强大的多样本转录组组装。

Nat Methods. 2017 Jan;14(1):68-70. doi: 10.1038/nmeth.4078. Epub 2016 Nov 21.

引用本文的文献

K2R: Tinted de Bruijn graphs implementation for efficient read extraction from sequencing datasets.K2R：用于从测序数据集中高效提取 reads 的带颜色的德布鲁因图实现。

Bioinform Adv. 2025 May 14;5(1):vbaf111. doi: 10.1093/bioadv/vbaf111. eCollection 2025.

Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters.基于分层交错异或过滤器的长读快速且节省空间的分类学分类。

Genome Res. 2024 Jul 23;34(6):914-924. doi: 10.1101/gr.278623.123.

Methods for Pangenomic Core Detection.泛基因组核心检测方法。

Methods Mol Biol. 2024;2802:73-106. doi: 10.1007/978-1-0716-3838-5_4.

Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.使用 kmindex 和 ORA 在 TB 级别的复杂基因组数据集上进行索引和实时用户友好查询。

Nat Comput Sci. 2024 Feb;4(2):104-109. doi: 10.1038/s43588-024-00596-6. Epub 2024 Feb 26.

Scalable sequence database search using partitioned aggregated Bloom comb trees.基于分区聚合布隆过滤树的可扩展序列数据库搜索。

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping.KMCP：通过伪映射对原核生物和病毒种群进行准确的宏基因组分析。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac845.

Strain level microbial detection and quantification with applications to single cell metagenomics.利用单细胞宏基因组学进行菌株水平微生物检测和定量。

Nat Commun. 2022 Oct 28;13(1):6430. doi: 10.1038/s41467-022-33869-7.

Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments.针：一种快速且节省空间的预过滤器，用于估计大量表达实验的定量。

Bioinformatics. 2022 Sep 2;38(17):4100-4108. doi: 10.1093/bioinformatics/btac492.

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.CMash：基于 k-mer 的 Jaccard 和包含指数的快速、多分辨率估计。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i28-i35. doi: 10.1093/bioinformatics/btac237.

SPRISS: approximating frequent k-mers by sampling reads, and applications.SPRISS：通过读取采样来近似频繁的 k-mers 及其应用。

Bioinformatics. 2022 Jun 27;38(13):3343-3350. doi: 10.1093/bioinformatics/btac180.

本文引用的文献

AllSome Sequence Bloom Trees.所有一些序列布隆树。

J Comput Biol. 2018 May;25(5):467-479. doi: 10.1089/cmb.2017.0258. Epub 2018 Apr 5.

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.布隆过滤器前缀树：一种用于泛基因组存储的无比对和无参考的数据结构。

Algorithms Mol Biol. 2016 Apr 14;11:3. doi: 10.1186/s13015-016-0066-8. eCollection 2016.

Fast search of thousands of short-read sequencing experiments.快速搜索数千个短读长测序实验。

Nat Biotechnol. 2016 Mar;34(3):300-2. doi: 10.1038/nbt.3442. Epub 2016 Feb 8.

Entropy-scaling search of massive biological data.海量生物数据的熵尺度搜索

Cell Syst. 2015 Aug 26;1(2):130-140. doi: 10.1016/j.cels.2015.08.004.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.这些不是你要找的k-mer：使用概率数据结构进行高效在线k-mer计数。

PLoS One. 2014 Jul 25;9(7):e101271. doi: 10.1371/journal.pone.0101271. eCollection 2014.

Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.旗鱼能够使用轻量级算法从RNA测序读段中进行无比对的异构体定量分析。

Nat Biotechnol. 2014 May;32(5):462-4. doi: 10.1038/nbt.2862. Epub 2014 Apr 20.

Compressive genomics for protein databases.基于压缩的基因组学蛋白质数据库。

Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214.

CRAC: an integrated approach to the analysis of RNA-seq reads.CRAC：一种用于RNA测序读数分析的综合方法。

Genome Biol. 2013 Mar 28;14(3):R30. doi: 10.1186/gb-2013-14-3-r30.

Compressive genomics.压缩基因组学

Nat Biotechnol. 2012 Jul 10;30(7):627-30. doi: 10.1038/nbt.2241.

The sequence read archive.序列读取存档库。

Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21. doi: 10.1093/nar/gkq1019. Epub 2010 Nov 9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。