Tan Steven, Majidian Sina, Langmead Ben, Zakeri Mohsen
Department of Computer Science, Johns Hopkins University, USA.
bioRxiv. 2025 May 27:2025.05.22.655637. doi: 10.1101/2025.05.22.655637.
The number of reference genomes is rapidly increasing, thanks to advances in long-read sequencing and assembly. While these collections can improve the sensitivity and specificity of classification methods, this requires highly efficient compressed indexes. K-mer-based approaches like Kraken 2 are efficient but limit the analysis to a fixed k-mer length. This is hard for the user to set ahead of time, and suboptimal settings can harm sensitivity and specificity. Methods that use compressed full-text indexes like SPUMONI2 and Cliffy lift this constraint, but are less efficient than k-mer-based tools. Further, these methods either cannot report a full listing of genomes where a match occurs, or cannot scale to large reference databases. We propose new methods and algorithms that use compressed full-text indexes to enable multi-class and taxonomic classification. Unlike past compressed-indexing methods for classification, ours uses the move structure, which is extremely fast thanks to its locality of reference. Our method, called Movi Color, augments the main table of the Movi index. Specifically, Movi Color assigns a "color" to each run of the Burrows-Wheeler Transform according to the subset of genomes from which the run suffixes originated. When the reference is highly repetitive - as is typical when indexing pangenomes or reference databases - only certain colors occur, creating opportunities to compress the index. For species-level classification, Movi Color achieves over 1.6× higher precision and 2× higher recall than Kraken 2 and Metabuli. At the genus level, it achieves 70% higher precision and 80% higher recall. Movi Color's read processing time is 7-20× faster than Metabuli and is a comparable to Kraken 2. Although Movi Color uses more memory than both Kraken 2 and Metabuli, its speed-accuracy trade-off makes it well-suited for real-time or high-throughput scenarios.
由于长读长测序和组装技术的进步,参考基因组的数量正在迅速增加。虽然这些集合可以提高分类方法的灵敏度和特异性,但这需要高效的压缩索引。像Kraken 2这样基于k-mer的方法效率很高,但将分析限制在固定的k-mer长度。这对用户来说很难提前设置,而且次优设置可能会损害灵敏度和特异性。使用压缩全文索引的方法,如SPUMONI2和Cliffy,消除了这一限制,但比基于k-mer的工具效率更低。此外,这些方法要么无法报告匹配发生的基因组的完整列表,要么无法扩展到大型参考数据库。我们提出了使用压缩全文索引来实现多类和分类学分类的新方法和算法。与过去用于分类的压缩索引方法不同,我们的方法使用移动结构,由于其引用的局部性,该结构速度极快。我们的方法称为Movi Color,它增强了Movi索引的主表。具体来说,Movi Color根据运行后缀所源自的基因组子集,为Burrows-Wheeler变换的每次运行分配一种“颜色”。当参考具有高度重复性时,这在索引泛基因组或参考数据库时很常见,只会出现某些颜色,从而创造了压缩索引的机会。对于物种水平的分类而言,Movi Color的精度比Kraken 2和Metabuli高出1.6倍以上,召回率高出2倍。在属水平上,它的精度高出70%,召回率高出80%。Movi Color的读取处理时间比Metabuli快7到-20倍,与Kraken 2相当。虽然Movi Color比Kraken 2和Metabuli都使用更多内存,但其速度与准确性的权衡使其非常适合实时或高通量场景。