Břinda Karel, Lima Leandro, Pignotti Simone, Quinones-Olvera Natalia, Salikhov Kamil, Chikhi Rayan, Kucherov Gregory, Iqbal Zamin, Baym Michael
Inria, Irisa, Univ. Rennes, Rennes, France.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Nat Methods. 2025 Apr;22(4):692-697. doi: 10.1038/s41592-025-02625-2. Epub 2025 Apr 9.
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
接近数百万个测序基因组的综合数据集已成为生命科学中的核心信息来源。然而,这些数据集的快速增长使得使用诸如基本局部比对搜索工具(BLAST)及其后续工具来搜索这些数据几乎变得不可能。在此,我们提出了一种称为系统发育压缩的技术,该技术利用进化历史来指导压缩,并使用现有的算法和数据结构有效地搜索大量微生物基因组集合。我们表明,当应用于接近数百万个基因组的现代多样化集合时,无损系统发育压缩将组装体、德布鲁因图和k-mer索引的压缩率提高了一到两个数量级。此外,我们开发了一种用于对这些系统发育压缩的参考数据进行类似BLAST搜索的流程,并证明它可以在普通台式计算机上几小时内将基因、质粒或整个测序实验与截至2019年的所有已测序细菌进行比对。系统发育压缩在计算生物学中具有广泛的应用,并可能为未来的基因组学基础设施提供一个基本的设计原则。