Solomon Brad, Kingsford Carl
Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.
J Comput Biol. 2018 Jul;25(7):755-765. doi: 10.1089/cmb.2017.0265. Epub 2018 Mar 12.
Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. We apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.
诸如美国国立卫生研究院序列读取存档库这样的大量短读长RNA测序实验数据库现已可用。这些数据库可以回答许多关于特定条件下的表达或群体变异的问题,而且这种资源只会随着时间的推移而不断增加。然而,由于无法搜索特定的表达序列,这些数据集仍然难以使用。尽管在这个问题上已经取得了一些进展,但在数百太字节的短读长测序实验集合中进行搜索仍然不可行。我们引入了一种名为分割序列布隆树(SSBTs)的索引方案,以支持对数千个短读长测序实验的太字节规模集合进行基于序列的查询。对于相同任务,SSBT是对序列布隆树(SBT)数据结构的一种改进。我们将SSBTs应用于寻找查询转录本表达条件的问题。我们在一组针对乳腺、血液和脑组织的2652个公开可用的RNA测序实验上进行了实验。我们证明,使用单线程在不到4分钟的时间内就可以对这个SSBT索引查询1000 nt的序列,并且它仅需39 GB的存储空间,与SBT相比,搜索和存储成本提高了五倍。