Ahmed Omar, Boucher Christina, Langmead Ben
bioRxiv. 2024 May 30:2024.05.25.595899. doi: 10.1101/2024.05.25.595899.
Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution. Advances in compressed indexing with the -index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use ( ) words of space where is the number of maximal-equal letter runs in the Burrows-Wheeler transform and is the number of distinct genomes. The linear dependence on is limiting, since real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, over 250x when indexing the SILVA 16S rRNA gene database. This method uses Θ( log ) words of space in expectation under a random model we propose here. We implemented these ideas in an open source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11-18%. Clade abundances are also more accurately predicted by Cliffy compared to Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries.
2012 ACM SUBJECT CLASSIFICATION: Applied computing Computational genomics.
分类序列分类是宏基因组学和进化研究中的核心计算问题。使用 -索引的压缩索引技术进步使得能够对大型序列集合进行全文模式匹配。但是,将模式序列与其起源分支相联系的数据结构在处理大型集合时仍无法很好地扩展。先前的工作提出了文档数组概况,其使用 ( ) 个字的空间,其中 是布罗伊登-惠勒变换中最大相等字母游程的数量, 是不同基因组的数量。对 的线性依赖具有局限性,因为实际分类法很容易包含数以万计甚至更多的叶节点。我们提出了一种名为悬崖压缩的方法,该方法可将此大小大幅缩减,在对SILVA 16S rRNA基因数据库进行索引时缩减超过250倍。在此处提出的随机模型下,该方法预期使用 Θ( log ) 个字的空间。我们在一个名为Cliffy的开源工具中实现了这些想法,该工具可针对压缩分类索引对测序读数进行高效的分类。当应用于模拟的16S rRNA读数时,Cliffy的读数级准确率比Kraken2高11 - 18%。与Kraken2和Bracken相比,Cliffy对分支丰度的预测也更准确。总体而言,Cliffy是对压缩全文索引的快速且节省空间的扩展,使其能够执行快速且准确的分类查询。
2012年美国计算机协会主题分类:应用计算 计算基因组学。