Suppr超能文献

使用拆分序列布隆树改进对大型转录组测序数据库的搜索

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.

作者信息

Solomon Brad, Kingsford Carl

机构信息

Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.

出版信息

J Comput Biol. 2018 Jul;25(7):755-765. doi: 10.1089/cmb.2017.0265. Epub 2018 Mar 12.

Abstract

Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. We apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.

摘要

诸如美国国立卫生研究院序列读取存档库这样的大量短读长RNA测序实验数据库现已可用。这些数据库可以回答许多关于特定条件下的表达或群体变异的问题,而且这种资源只会随着时间的推移而不断增加。然而,由于无法搜索特定的表达序列,这些数据集仍然难以使用。尽管在这个问题上已经取得了一些进展,但在数百太字节的短读长测序实验集合中进行搜索仍然不可行。我们引入了一种名为分割序列布隆树(SSBTs)的索引方案,以支持对数千个短读长测序实验的太字节规模集合进行基于序列的查询。对于相同任务,SSBT是对序列布隆树(SBT)数据结构的一种改进。我们将SSBTs应用于寻找查询转录本表达条件的问题。我们在一组针对乳腺、血液和脑组织的2652个公开可用的RNA测序实验上进行了实验。我们证明,使用单线程在不到4分钟的时间内就可以对这个SSBT索引查询1000 nt的序列,并且它仅需39 GB的存储空间,与SBT相比,搜索和存储成本提高了五倍。

相似文献

2
AllSome Sequence Bloom Trees.所有一些序列布隆树。
J Comput Biol. 2018 May;25(5):467-479. doi: 10.1089/cmb.2017.0258. Epub 2018 Apr 5.
3
Improved representation of sequence bloom trees.序列 Bloom 树的表示方法改进。
Bioinformatics. 2020 Feb 1;36(3):721-727. doi: 10.1093/bioinformatics/btz662.
5
Fast search of thousands of short-read sequencing experiments.快速搜索数千个短读长测序实验。
Nat Biotechnol. 2016 Mar;34(3):300-2. doi: 10.1038/nbt.3442. Epub 2016 Feb 8.

引用本文的文献

3
Methods for Pangenomic Core Detection.泛基因组核心检测方法。
Methods Mol Biol. 2024;2802:73-106. doi: 10.1007/978-1-0716-3838-5_4.

本文引用的文献

1
AllSome Sequence Bloom Trees.所有一些序列布隆树。
J Comput Biol. 2018 May;25(5):467-479. doi: 10.1089/cmb.2017.0258. Epub 2018 Apr 5.
3
Fast search of thousands of short-read sequencing experiments.快速搜索数千个短读长测序实验。
Nat Biotechnol. 2016 Mar;34(3):300-2. doi: 10.1038/nbt.3442. Epub 2016 Feb 8.
4
Entropy-scaling search of massive biological data.海量生物数据的熵尺度搜索
Cell Syst. 2015 Aug 26;1(2):130-140. doi: 10.1016/j.cels.2015.08.004.
7
Compressive genomics for protein databases.基于压缩的基因组学蛋白质数据库。
Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214.
9
Compressive genomics.压缩基因组学
Nat Biotechnol. 2012 Jul 10;30(7):627-30. doi: 10.1038/nbt.2241.
10
The sequence read archive.序列读取存档库。
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21. doi: 10.1093/nar/gkq1019. Epub 2010 Nov 9.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验