Song Li, Langmead Ben
Department of Biomedical Data Science, Dartmouth College, Hanover, NH.
Department of Computer Science, Johns Hopkins University, Baltimore, MD.
bioRxiv. 2023 Nov 17:2023.11.15.567129. doi: 10.1101/2023.11.15.567129.
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Centrifuger是一种高效的分类学分类方法,它将测序读数与微生物基因组数据库进行比较。在Centrifuger中,使用一种称为运行块压缩的新方案对Burrows-Wheeler变换后的基因组序列进行无损压缩。运行块压缩实现了亚线性空间复杂度,并且在压缩像RefSeq这样的各种微生物数据库时有效,同时支持快速排名查询。将这种压缩方法与其他用于压缩Ferragina-Manzini(FM)索引的策略相结合,与其他基于FM索引的方法相比,Centrifuger将内存占用减少了一半。此外,无损压缩和无限制的匹配长度有助于Centrifuger在较低分类级别上比竞争方法实现更高的准确性。