Ahmed Omar Y, Boucher Christina, Langmead Ben
Johns Hopkins University.
University of Florida.
Genome Res. 2025 Aug 25. doi: 10.1101/gr.279846.124.
Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution Advances in compressed indexing with the -index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use () words of space where is the number of maximal-equal letter runs in the Burrows-Wheeler transform and is the number of distinct genomes. The linear dependence on is limiting, since real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, over 250× when indexing the SILVA 16S rRNA gene database. This method uses Θ( log ) words of space in expectation under a random model we propose here. We implemented these ideas in an open source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11-18%. Clade abundances are also more accurately predicted by Cliffy compared to Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries. Cliffy's accuracy underscores the advantages of full-text indexes, which offer a more precise solution compared to -mer indexes designed for a specific value.
分类序列分类是宏基因组学和进化研究的核心计算问题。基于后缀索引的压缩索引技术的进步使得能够对大型序列集合进行全文模式匹配。但是,将模式序列与其起源的进化枝联系起来的数据结构在处理大型集合时仍然扩展性不佳。先前的工作提出了文档数组概况,它使用(O(nw))个单词的空间,其中(n)是布罗伊登-惠勒变换中最大相等字母游程的数量,(w)是不同基因组的数量。对(w)的线性依赖是有局限性的,因为实际的分类法很容易包含数以万计的叶子或更多。我们提出了一种称为悬崖压缩的方法,该方法可以将这个大小大幅减小,在索引SILVA 16S rRNA基因数据库时减小超过250倍。在此提出的随机模型下,该方法预期使用(\Theta(n\log n))个单词的空间。我们在一个名为Cliffy的开源工具中实现了这些想法,该工具针对压缩分类索引对测序读数进行高效的分类。当应用于模拟的16S rRNA读数时,Cliffy的读数级准确率比Kraken2高11%-18%。与Kraken2和Bracken相比,Cliffy对进化枝丰度的预测也更准确。总体而言,Cliffy是对压缩全文索引的一种快速且节省空间的扩展,使其能够执行快速准确的分类查询。Cliffy的准确性突出了全文索引的优势,与为特定(k)值设计的(k)-mer索引相比,全文索引提供了更精确的解决方案。