Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland.
Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland.
Genome Res. 2022 Sep 27;32(9):1754-1764. doi: 10.1101/gr.276607.122.
Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of (or ) for efficiently indexing -mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose , a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a -mer count or its positions). Counting de Bruijn graphs index -mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing -mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
测序数据在公共存储库中迅速积累。为了实现大规模交互式分析,需要高效的方法来存储和索引这些资源。最近,在构建用于有效索引 -mer 集的压缩表示方面取得了显著进展。然而,以通用方式表示基因表达或基因组位置等定量属性的方法仍未得到充分探索。在这项工作中,我们提出了 ,这是一种通过为每个节点-标签关系补充一个或多个属性(例如, -mer 计数或其位置)来扩展注释的 de Bruijn 图的概念。计数 de Bruijn 图通过与最先进的生物信息学工具相比,以小 8 倍的表示形式索引来自 2652 个人类 RNA-seq 样本的 -mer 丰度,并且构建和查询速度更快。此外,具有位置注释的计数 de Bruijn 图无损地表示索引中的整个读取,平均比使用 gzip 压缩的输入小 27%,对于人类 Illumina RNA-seq 为 57%,对于 Pacific Biosciences (PacBio) 的病毒样本 HiFi 测序更小。来自 NCBI 的 Sequence Read Archive (SRA) 的所有病毒 PacBio SMRT 读取的完整可搜索索引(152,884 个样本,875 Gbp)仅包含 178GB。最后,在完整的 RefSeq 集合上,我们生成了一个无损且完全可查询的索引,比 MegaBLAST 索引小 4.6 倍。这项工作中提出的技术自然补充了使用 de Bruijn 图的现有方法和工具,并显著扩大了它们的适用性:从索引 -mer 计数和基因组位置到在高度压缩的基于图的序列索引之上实现新的序列比对算法。